mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
109 stars 27 forks source link

missing data is ignored with variant files but not .Rtab #157

Open julibeg opened 3 years ago

julibeg commented 3 years ago

Missing genotypes in variant files are ignored: https://github.com/mgalardini/pyseer/blob/2e27979568ee34f02d000ca3011002b9d399fb38/pyseer/input.py#L485-L486 However, in .Rtab files they are treated as missing data and the fit fails later on: https://github.com/mgalardini/pyseer/blob/2e27979568ee34f02d000ca3011002b9d399fb38/pyseer/input.py#L423-L424

Is this intended? For now I have replaced d[sample] = np.nan with continue to also get a fit for genes with a few missing entries.

mgalardini commented 3 years ago

I think this is due to the fact that we do not really expect missing values in an .Rtab file, whereas they can be quite common in vcf files. I think we could implement your proposed change to be more consistent.

Just out of curiosity, was the .Rtab file you were using coming out of panaroo/roary? If so I was not aware of the fact that it could contain missing values

julibeg commented 3 years ago

Makes sense.

No, it was a custom .Rtab file.

mgalardini commented 3 years ago

Ok that makes sense. If you would like to open a PR we could merge this change. If you know how to add unit tests that would also be great. If not, I can do it once the change it's merged

julibeg commented 3 years ago

will do