ignores missing entries in .Rtab files

julibeg commented 3 years ago

Regarding unit testing: After the change, all tests passed. Should we / I write new tests or rather change the tests/presence_absence.Rtab.gz file to include some missing values and adapt the tests accordingly (or leave testing as is since the change won't break things that worked before anyway)?

mgalardini commented 3 years ago

Thanks for the PR; maybe making a new Rtab file with missing values and making sure those are handled correctly in an ad-hoc unit test? Happy to prepare it if you can prepare a simple example file.

julibeg commented 2 years ago

Sorry for the delay! I added a simple unit test now.

julibeg commented 2 years ago

When looking at the code I realised that missing values are always replaced by 0s regardless of file type (as far as I can tell). Is this intended? Otherwise, AF could be used or the majority allele.

mgalardini commented 2 years ago

Thanks, I'll review the changes shortly and merge.

I checked again and we do ignore the samples with missing values (see here: https://github.com/mgalardini/pyseer/blob/1359b614491bca96264d9bd8e6535ec90a3cbedf/pyseer/input.py#L427).

I think we may want to assume that an empty "cell" also means that the gene has missing information on its status, other than the . char. Will try to add that change too

mgalardini commented 2 years ago

Ah wait I think we might have had a misunderstanding (my fault really!). How we handle missing variants in VCF files is a bit convoluted because we have to handle corner cases due to burden testing. We do assign a missing value if a variant is missing from the VCF file (https://github.com/mgalardini/pyseer/blob/1359b614491bca96264d9bd8e6535ec90a3cbedf/pyseer/input.py#L479).

So I think we should not merge this, unless it is not working as intended as it is (i.e. the --max-missing option is working as expected with Rtab files).

Sorry for not thinking this through earlier!

julibeg commented 2 years ago

I see. I can't say that I understand all the edge cases and how they are handled in the code. However, there is also https://github.com/mgalardini/pyseer/blob/1359b614491bca96264d9bd8e6535ec90a3cbedf/pyseer/input.py#L486 which removes samples from the dict if the second haplotype is also '.' (which should always be the case in bacterial VCFs). The change in cfed703 should do the same thing for .Rtab files, but I don't know how this relates to burden testing etc.

johnlees commented 2 years ago

Missing data is difficult, and in my opinion is something that's more of a user consideration. But it is expected in many cases. I think the desired pyseer behaviour would be:

Exclude samples with a missing genotype from the regression in question (easy enough with the linear/wg models, not sure with the LMM as that's done in blocks). Does setting nan work ok?
With VCF as input missing sites are frequent so dealing with them this way ideally. If not possible, e.g. in the LMM, setting them to the major allele is probably sensible.
With kmers/unitigs missing data = absent so we don't need to deal with this.
Agree that Rtabs should be consistent with VCFs, even though the data we have used up until this point hasn't had missing elements.

Although I'm open to suggestions!

Using del d[sample] sets to 0, which is the simplest thing we could do. I think we should confirm whether np.nan works as intended, and if so use that with both Rtab and VCF.

If not, we should do a simple fill. plink has a similar option for missing data e.g. use the reference, use the major allele, use the minor allele. In our case I think we just want use major allele or use reference (0). We could either just pick one of these, or we could add as a command line option.

mgalardini / pyseer

ignores missing entries in .Rtab files #158