hallmark of cancer - Githubissues

quan-possible / med-text

Classifying medical text.

Apache License 2.0

0 stars 0 forks source link

hallmark of cancer #3

Closed shaoxiongji closed 3 years ago

shaoxiongji commented 3 years ago

Use this processed data for HoC classification https://github.com/cambridgeltl/cancer-hallmark-cnn

It converts multi-label classification into 10 binary classification sets.

quan-possible commented 3 years ago

Is there anything else about it besides being 10 binary classification sets?

shaoxiongji commented 3 years ago

There are four subfolders in the data folders. doc-10-class should be the target one. I don't understand well for others. But oversampled one should be balanced using the over-sampling technique.

quan-possible commented 3 years ago

But we should just use everything as a multilabel classification problem right? No need to do anything else.

shaoxiongji commented 3 years ago

Defining the problem as either multilabel or many binary classification tasks is okay. The multilabel setting uses sigmoid activation to generate logits, which is a form of binary classification to some extent.

quan-possible commented 3 years ago

I just implemented code for processing the new data and .csv files that I got from it. Please check it out.

shaoxiongji commented 3 years ago

It looks good.

which processed set are you use? cancer-hallmark-cnn/data/doc-10-class? or one of the other three?
please check the annotation statistics is consistent with Table 1 in the paper “Cancer Hallmark Text Classiﬁcation Using Convolutional Neural Networks”
remove \n in text
why use data_dict as a global variable?

quan-possible commented 3 years ago

Yes, I used cancer-hallmark-cnn/data/doc-10-class.
I finished checking and it looks good!
Just done it.
I was thinking because of the number of appending operations, using pandas Dataframe would be very expensive. So the only option for fast access and appending is Dictionary. Please correct me if I'm wrong. I'm not sure about this

shaoxiongji commented 3 years ago

For the last bullet point, you're right about this. But my question is not about this. Anyway, it works. Please proceed to the modeling part.