Closed shaoxiongji closed 3 years ago
Is there anything else about it besides being 10 binary classification sets?
There are four subfolders in the data folders. doc-10-class
should be the target one. I don't understand well for others. But oversampled one should be balanced using the over-sampling technique.
But we should just use everything as a multilabel classification problem right? No need to do anything else.
Defining the problem as either multilabel or many binary classification tasks is okay. The multilabel setting uses sigmoid activation to generate logits, which is a form of binary classification to some extent.
I just implemented code for processing the new data and .csv files that I got from it. Please check it out.
It looks good.
\n
in text data_dict
as a global variable? For the last bullet point, you're right about this. But my question is not about this. Anyway, it works. Please proceed to the modeling part.
Use this processed data for HoC classification https://github.com/cambridgeltl/cancer-hallmark-cnn
It converts multi-label classification into 10 binary classification sets.