How to generate the correct input file for DeepMicro

minoh0201 / DeepMicro

Deep representation learning for disease prediction based on microbiome data

MIT License

57 stars 14 forks source link

How to generate the correct input file for DeepMicro #1

Open oschakoory opened 3 years ago

oschakoory commented 3 years ago

Hi, I would like to use DeepMicro for disease prediction.

But i can't figure out how to generate the 'correct' input data for DeepMicro. I saw that the UserDataExample.csv has a lot of digits, where each row represents a sample and each column represents a microbe, but how did you get that table?

The datasets i have are (i) paired-end fastq files (ii) reconstructed 16S sequences (in fasta format) from the paired-end fastq files (iii) taxonomy file + abundance of each microorganisms (in csv format)

How do i convert these info into a table similar to UserDataExample.csv?

Thank you for your help

minoh0201 commented 3 years ago

You may want to write your own script (e.g. Python or R) to convert your data files into the desired format. I believe you should consider normalization methods based on your need. Here are some articles that are relevant: 1) https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4637-6 2) https://www.biostars.org/p/392811/

Also, you can consider using taxonomic profilers such as MetaPhlan3 (https://huttenhower.sph.harvard.edu/metaphlan/) and mOTUs2 (https://motu-tool.org/).

oschakoory commented 3 years ago

Thank you for your reply. I have another question about the label file. The binary value 0 (absent) 1(present) is based on what? if i have 5 controls and taxon 1 is present in 3 of 5 control, how do i determine whether it is a 0 or 1?

Thank you for your help. I would like to mention that DeepMicro is a very good algorithm and simpler to understand compared to others DL that i have used until now. Keep up the excellent job!

minoh0201 commented 3 years ago

Presence (1) or absence (0) of a certain strain is an independent observation for each sample. It does not depend on another sample but solely indicates if a certain strain was found in a single sample. Please refer to MetaPhlan2 strain-level profiling.

oschakoory commented 3 years ago

I trained DeepMicro (svm) with 48 samples (control + diseased) as UserDataExample.csv. Then i used one of the diseased information (purposely) for prediction as LabelDataExample.csv

python DM.py -r 1 -cd UserDataExample.csv -cl LabelDataExample.csv -m svm

I got these informations:

Accuracy metrics
AUC, ACC, Recall, Precision, F1_score, time-end, runtime(sec), classfication time(sec), best hyper-parameter
[0.8565, 0.8929, 0.4, 1.0, 0.5714, '2021-07-19 09:26:12.410477', 0.35, 0.35, "{'C': 32, 'gamma': 0.00048828125, 'kernel': 'rbf'}"]

Can you help me identify the parameter(s) that i need to use to predict whether the LabelDataExample.csv is more likely a control or a diseased patient?

Thank you for your precious help.