Open iqbal-lab opened 4 years ago
This is trivial for illumina as it is easy to make it binary - either all/most of the read maps or not. But for nanopore it's less obvious . if a long read covers two genes but has a big unmapped region in the middle, then you might want that unmapped bit in the middle, or the whole read.
The best solution here could be an option to create a file with mapping coordinates for each read. I think we already print this in the debug log, so it would just be a matter of "formalising" it to go to file instead.
Yeah, we could also do what error correctors do when correcting a long read: split the long read into several shorter, well corrected regions if they don't manage to correct the whole read. In this case, we would split a long read into its unmapped regions, if it happens to have mapped and unmapped regions.
I don't think it is a nice option to write the whole long read as unmapped if a part of it does not map, as PRGs are mostly biased towards genic regions, and long reads spanning intergenic regions will be more likely to be consider unmapped. But again, I have experience in eukaryotes, where we always have to think about intergenic regions, not sure this assumption is still valid for bacteria.
Well the U shaped frequency distribution for genes means there are always going to be new rare genes in a new dataset which are not in the PRG, so we expect unmapped chunks in the middle
I'm not saying we print out something that says mapped or unmapped. If we provide the read and its mapping coordinates (a line for each set of coordinates - so a read ID could occur multiple times) then it is up to the user to decide what to do with this. I.e. they could look for unmapped chunks, or just for reads that have no mapping coordinates.
That sounds a bit more flexible to me? Or am I missing something here?
I guess we're converging on GAM (VGs version of BAM that I can't find any specs for)
Yeah, that is indeed more flexible. From a user POV, I would prefer that this info would be given to me in a standard file format, where can be fed into several other tools, and I don't need to pre-process it. I think fasta/q
or SAM/BAM
format would be great. SAM/BAM
is still hard to do in pandora I think, we need to go from minimizer alignment to base-to-base alignment, so I don't think it is feasible now (but indeed possible in the future).
However, fasta/q
format is feasible. I was thinking about two files, one for mapped and the other for unmapped chunks. Header would be read id followed by the mapping coordinates, and the sequence would correspond to that chunk of the read. Obviously, the user would know if the chunk was mapped or unmapped based on the file the chunk is. We could also output just one fasta file, with a tag in the header (mapped
or unmapped
)
Edit: sorry, I wrote this while you were writing about GAM. GAM still needs base-to-base alignment, I suppose?
would it be possible and would it be hard to add an option to print to file any reads which do not map to the PRG?