Hello I want to make a new model

salbrec / seqQscorer

MIT License

22 stars 9 forks source link

Hello I want to make a new model #3

Closed kyungkeonchung closed 3 years ago

kyungkeonchung commented 3 years ago

Hello, I want to make a new classification model. So, Is there a way for me to use the training data you used? I think there's a limit to getting the data myself and pre-processing it.

I would appreciate it if you could provide me with a guide for creating and training new models.

salbrec commented 3 years ago

Hello!

Sure, you can use the datasets that were by seqQscorer also to train new models.

You’ll find them under seqQscorer/utils/datasets - all tables in tsv format. The file “generic_generic_generic.tsv”, for instance, contains the full dataset of all 2072 samples for all assays, species and run-types containing also the columns describing to which assay, species or run-type a sample belongs to. However, several subsets, specific to certain selections, are also available in the same folder. The tables contain all the features, so all these samples are fully preprocessed.

The last column always describes the quality status. Note that “1” means that it was assigned to the status “revoked”, hence low quality, while the “0” status indicates good quality samples.

kyungkeonchung commented 3 years ago

Thanks for your reply. It was the exam period, so I checked the reply a little late. I thought I'd have to preprocess all the data myself, but I'm so happy to know it's not.

I checked that there was a lot of data in seqQscorer/utils/datasets. I wonder what data you used for training and testing. I would like to make a model using the same data for comparison with the paper results. I'd like to try a number of deep learning models, but I'm worried that the data is too scarce, as the paper says.

Thank you very much for your sincere response.

Muedi commented 3 years ago

Hi,

The most results in the main paper were generated with the generic model (trained from the data in generic_generic_generic.tsv). So if you use this the results should be similar. Small differences can of course always occur with machine learning techniques.

We also trained and tested on the other available subsets, the results are available in the corresponding paper and its supplementals.

If you build a deep learning model on the datasets, feel free to share your results with us, it would be very interesting. We used a fully connected feed-forward MLP which could perform quite well, but was often outperformed by the tree-based ensemble methods (not by much though).