Suggestion to include "missing data" codes

capoony commented 1 year ago

Dear Lucas,

I have a suggestion how to handle missing information. In our recent papers (e.g. here and here), we extended the original SYNC file code by a missing data code. I.e., instead of using 0:0:0:0:0:0, which may either indicate missing data (e.g. by not passing SNP calling criteria) or zero coverage, we introduced the .:.:.:.:.:. code, which unambiguously indicates that this position is masked.

It would be very useful if your software could handle this as well.

Thanks a lot,

Martin

lczech commented 1 year ago

Hi Martin,

thank you for the suggestion! I'm also seeing a new format gSYNC in the second paper that you linked - is that yet another variant of the sync format, or simply a sync file with all positions (even the invariant ones) present?

As for your suggestion, for clarification: Do you simply want the sync parser in grenedalf to also be able to read this format? In that case, consider it done. Or would you want to treat missing data of that format also different from a zero-coverage position? Because at least for the statistics that we compute in grenedalf, zero-coverage positions are ignored anyway.

Lastly, it seems that yet another ad-hoc bioinformatics file format has reached its limit, as new demands arose. A few month ago, I've had a short email exchange with the group of Alex Fournier-Level (to get their attention, mentioning them here: @aflevel and @jeffersonfparil). They also came up with a custom sync extension that has an additional count for insertions. @MoisesExpositoAlonso and I have been discussion a proper file format for pool-seq data for a while now, that would be more flexible that sync, and (hopefully) cover more use cases, such as the ones you all have, and additionally offer to use allele frequencies instead of counts, and other shenanigans. Maybe it's time to get together and discuss this with a few interested people at some point, before we have 100 variants for the format :-)

Cheers and so long Lucas

capoony commented 1 year ago

Hi Lucas,

thanks for your swift and extensive response! Yes, the gSYNC is basically a file with all sites included (similar to the gVCF from GATK). The idea was to facilitate a simple extension of the gSYNC with new data given the same dimensionality.

We introduced the .:.:.:.:.:. field to distinguish masked from zero-coverage positions.

I am totally in for disucssing standardization of formats, this would be super useful to keep conistency and to implemet improvements. 👍

All the very best Martin

lczech commented 1 year ago

Hi Martin,

thanks, I still have some questions:

By "dimensionality", you mean number of positions per chromosome? Or is there more to it?

And for the masked field: How is it treated different from a zero-coverage position in your analyses? Would you also want different treatment for the statistics that grenedalf currently computes? As said, a zero-counts entry would currently also be ignored, so apart from being able to parse it (which I have now implemented), it does not seem that I'd need to change anything else, right?

Cheers Lucas

capoony commented 1 year ago

Hi Lucas,

thanks a lot!

Hi Martin,

thanks, I still have some questions:

By "dimensionality", you mean number of positions per chromosome? Or is there more to it?

Exactly, it only means that it has as many positions as in the Reference genome.

And for the masked field: How is it treated different from a zero-coverage position in your analyses? Would you also want different treatment for the statistics that grenedalf currently computes? As said, a zero-counts entry would currently also be ignored, so apart from being able to parse it (which I have now implemented), it does not seem that I'd need to change anything else, right?

Perfect! Right now, we do not make use of this entry either, but we wanted a clean separation between masked and missing data.

Cheers Lucas

Best, Martin

lczech commented 1 year ago

Hi Martin,

thanks for the clarifications!

One more thought: the original sync files produced by PoPoolation2 already contain every position where there is data. So, a gsync then simply also adds those positions with no coverage in the data? And I assume in your implementation, for these positions, you use your missing data code, right?

I've seen that you used a custom script (as so often) to produce these files in your paper. If you think that this would be useful to others, it's easy for me to implement a command to produce these files in grenedalf as well.

What do you think? Lucas

lczech commented 1 year ago

Hi @capoony and others,

the current version of grenedalf on the master branch can now read this format, and it will be part of the next release version 0.2.0, once I've tested the code a bit more.

Still, the whole discussion about formats should not be forgotten. I think we should continue via email though, including @aflevel and @jeffersonfparil, @MoisesExpositoAlonso, and maybe people from the Petrov lab at Stanford as well, and have a meeting at some point. I'm currently wrapping up a couple of projects, but after that might be a good time.

Cheers and so long Lucas

jeffersonfparil commented 1 year ago

Hi @lczech ,

Thanks for including me in the discussion. I agree, we should definitely start a discussion over formats. I've actually scrapped our syncx format with insertions and opted to just add the capacity for meta data with '#' on top of the file to label the pools.

lczech commented 1 year ago

Hi @jeffersonfparil,

thanks for following up on this! Ah, that's interesting. Can you send me an example of that header? I had a similar extension in mind (as a hot-fix until we have figured out a proper format), but if you already have one, I could use the same :-)

Cheers Lucas

jeffersonfparil commented 1 year ago

Hi @lczech, I just use this for now:

#chr    pos ref UG-pool1.bam    UG-pool2.bam    UG-pool3.bam    UG-pool4.bam    UG-pool5.bam
Chromosome1 456527  C   0:0:4:0:0:0 0:1:2:0:0:0 0:2:4:0:0:0 0:1:4:0:0:0 0:1:6:0:0:0
Chromosome1 1041321 T   0:20:0:1:0:0    0:12:0:0:0:0    0:15:0:1:0:0    0:22:0:1:0:0    0:26:0:1:0:0
Chromosome1 1131021 C   0:1:1:0:0:0 0:1:1:0:0:0 0:0:1:0:0:0 0:1:5:0:0:0 0:1:0:0:0:0
Chromosome1 1133215 C   0:0:5:0:0:0 1:0:2:0:0:0 0:0:12:0:0:0    0:0:8:0:0:0 0:0:13:0:0:0
Chromosome1 1133226 T   0:5:0:0:0:0 0:3:0:0:0:0 0:14:0:0:0:0    1:9:0:0:0:0 0:12:0:0:0:0
Chromosome1 1133227 G   0:0:0:4:1:0 0:0:0:1:2:0 0:0:0:10:4:0    0:0:0:3:7:0 0:0:0:8:5:0

lczech commented 1 year ago

Perfect, that's exactly what I had in mind :-) I hope I'll get to implement that soon!

lczech commented 1 year ago

Hey @capoony and @jeffersonfparil,

implemented the ad-hoc sync header format as suggested above now, so that grenedalf can read and write files with that header. It was definitely annoying for me to externally having to keep track of which column came from which sample, so I hope this makes it easier ;-)

Cheers Lucas

lczech commented 5 months ago

Hi @capoony,

just wanted to give an update here, despite the issue being closed already. The latest release grenedalf v0.5.0 now finally also offers to use mask files and proper normalization of values using the high-quality positions, hence taking missing data and/or the mask file into account.

Just wanted to mention this, in case it's relevant for you :-)

Cheers and so long Lucas

lczech / grenedalf

Suggestion to include "missing data" codes #4