tyjo / coptr

Accurate and robust inference of microbial growth dynamics from metagenomic sequencing
GNU General Public License v3.0
16 stars 5 forks source link

How should missing values in the output be treated? #11

Closed brendanwee closed 2 years ago

brendanwee commented 2 years ago

In the final output of my experiment, I get a decently populated matrix. I wish to incorporate this matrix into a machine learning model. Looking through the paper, I didn't see any mention of having to handle missing values. (unless I missed it). How would you treat missing values? Assuming that they cannot be ignored

tyjo commented 2 years ago

That's a good question. The analyses in the paper only compared non-missing PTRs within the same species.

Missing values are challenging to deal with because they could be missing for two reasons: 1) The species is not in the sample, or 2) The species is below the detection threshold.

If you could distinguish the two types of missing values, you could try to impute missing values of the type 2. However, it's not clear to me how to set missing values for species that are not in a sample.

If your model starts by taking a linear combination of an input vector, one possibility is to set missing values to 0. I am not sure how well this will work in practice.

brendanwee commented 2 years ago

hmm... If you reduced the threshold parameters on the CoPTR all the way, ex. min_reads, mincov, would it predict a PTR for all species with a read mapped? I wonder if it would be useful to have CoPTR (When given a parameter) predict the full matrix of species detected, for samples that were unable to predict a PTR in them

tyjo commented 2 years ago

Even if you lowered the threshold, you likely won't get a full matrix. There could be low abundance species in a sample without any sequencing reads.

brendanwee commented 2 years ago

yeah... Thanks for the ideas. I'll let you know if we come up with something useful!