iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
110 stars 14 forks source link

option to print out reads that do not map #214

Open iqbal-lab opened 4 years ago

iqbal-lab commented 4 years ago

would it be possible and would it be hard to add an option to print to file any reads which do not map to the PRG?

iqbal-lab commented 4 years ago

This is trivial for illumina as it is easy to make it binary - either all/most of the read maps or not. But for nanopore it's less obvious . if a long read covers two genes but has a big unmapped region in the middle, then you might want that unmapped bit in the middle, or the whole read.

mbhall88 commented 4 years ago

The best solution here could be an option to create a file with mapping coordinates for each read. I think we already print this in the debug log, so it would just be a matter of "formalising" it to go to file instead.

leoisl commented 4 years ago

Yeah, we could also do what error correctors do when correcting a long read: split the long read into several shorter, well corrected regions if they don't manage to correct the whole read. In this case, we would split a long read into its unmapped regions, if it happens to have mapped and unmapped regions.

I don't think it is a nice option to write the whole long read as unmapped if a part of it does not map, as PRGs are mostly biased towards genic regions, and long reads spanning intergenic regions will be more likely to be consider unmapped. But again, I have experience in eukaryotes, where we always have to think about intergenic regions, not sure this assumption is still valid for bacteria.

iqbal-lab commented 4 years ago

Well the U shaped frequency distribution for genes means there are always going to be new rare genes in a new dataset which are not in the PRG, so we expect unmapped chunks in the middle

mbhall88 commented 4 years ago

I'm not saying we print out something that says mapped or unmapped. If we provide the read and its mapping coordinates (a line for each set of coordinates - so a read ID could occur multiple times) then it is up to the user to decide what to do with this. I.e. they could look for unmapped chunks, or just for reads that have no mapping coordinates.

That sounds a bit more flexible to me? Or am I missing something here?

mbhall88 commented 4 years ago

I guess we're converging on GAM (VGs version of BAM that I can't find any specs for)

leoisl commented 4 years ago

Yeah, that is indeed more flexible. From a user POV, I would prefer that this info would be given to me in a standard file format, where can be fed into several other tools, and I don't need to pre-process it. I think fasta/q or SAM/BAM format would be great. SAM/BAM is still hard to do in pandora I think, we need to go from minimizer alignment to base-to-base alignment, so I don't think it is feasible now (but indeed possible in the future).

However, fasta/q format is feasible. I was thinking about two files, one for mapped and the other for unmapped chunks. Header would be read id followed by the mapping coordinates, and the sequence would correspond to that chunk of the read. Obviously, the user would know if the chunk was mapped or unmapped based on the file the chunk is. We could also output just one fasta file, with a tag in the header (mapped or unmapped)

Edit: sorry, I wrote this while you were writing about GAM. GAM still needs base-to-base alignment, I suppose?

iqbal-lab commented 4 years ago
  1. if we're outputting just the unmapped chunks, we can simply use tabix?
  2. outputting alignments of the reads to the graphs (GAM) is a wholly different thing, not from this enhancement request
  3. ah yes, i guess one file could cover unmapped chunks and mapped chunks showing which gene they map to.