Open julibeg opened 3 years ago
Thanks for the PR; maybe making a new Rtab file with missing values and making sure those are handled correctly in an ad-hoc unit test? Happy to prepare it if you can prepare a simple example file.
Sorry for the delay! I added a simple unit test now.
When looking at the code I realised that missing values are always replaced by 0
s regardless of file type (as far as I can tell). Is this intended? Otherwise, AF could be used or the majority allele.
Thanks, I'll review the changes shortly and merge.
I checked again and we do ignore the samples with missing values (see here: https://github.com/mgalardini/pyseer/blob/1359b614491bca96264d9bd8e6535ec90a3cbedf/pyseer/input.py#L427).
I think we may want to assume that an empty "cell" also means that the gene has missing information on its status, other than the .
char. Will try to add that change too
Ah wait I think we might have had a misunderstanding (my fault really!). How we handle missing variants in VCF files is a bit convoluted because we have to handle corner cases due to burden testing. We do assign a missing value if a variant is missing from the VCF file (https://github.com/mgalardini/pyseer/blob/1359b614491bca96264d9bd8e6535ec90a3cbedf/pyseer/input.py#L479).
So I think we should not merge this, unless it is not working as intended as it is (i.e. the --max-missing
option is working as expected with Rtab files).
Sorry for not thinking this through earlier!
I see. I can't say that I understand all the edge cases and how they are handled in the code. However, there is also https://github.com/mgalardini/pyseer/blob/1359b614491bca96264d9bd8e6535ec90a3cbedf/pyseer/input.py#L486
which removes samples from the dict if the second haplotype is also '.'
(which should always be the case in bacterial VCFs). The change in cfed703 should do the same thing for .Rtab files, but I don't know how this relates to burden testing etc.
Missing data is difficult, and in my opinion is something that's more of a user consideration. But it is expected in many cases. I think the desired pyseer behaviour would be:
Although I'm open to suggestions!
Using del d[sample]
sets to 0, which is the simplest thing we could do. I think we should confirm whether np.nan works as intended, and if so use that with both Rtab and VCF.
If not, we should do a simple fill. plink has a similar option for missing data e.g. use the reference, use the major allele, use the minor allele. In our case I think we just want use major allele or use reference (0). We could either just pick one of these, or we could add as a command line option.
Regarding unit testing: After the change, all tests passed. Should we / I write new tests or rather change the
tests/presence_absence.Rtab.gz
file to include some missing values and adapt the tests accordingly (or leave testing as is since the change won't break things that worked before anyway)?