lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
478 stars 132 forks source link

Sam2Tsv --regions option not working #182

Closed MiHern closed 3 years ago

MiHern commented 3 years ago

Subject of the issue

Sam2Tsv --regions option not working

Hi,

I'm trying to use Sam2Tsv to extract the base of certain positions that have been called as polymorphic when variant calling from each read of a BAM file. Therefore, this tool is ideal by using the --regions option with the .vcf as input for that option. However, I have tried and it outputs me a huge file with all the positions from all the reads of the BAM file. I have tried by using a .bed file instead and I have the same. It is not filtering as if this option is not working or maybe I'm not formatting the files correctly.

Any idea on how to solve this issue?

lindenb commented 3 years ago

Hi, I'llhave a look at this. A straightforward solution is to pipe the output of samtools into sam2tsv.

samtools view -M -L in.bed -O BAM in.bam | java -jar dist/sam2tsv.jar (...)
MiHern commented 3 years ago

Thanks for the quick answer.

With that samtools line it is filtering out the reads that do not have overlaps with the positions in the bed file, but it is not filtering out all the positions that are not in the bed in each read, no?

Let me explain in a bit more detail. In my case I have data from amplicon seq that is 1000bp (i.e.). I have mapped it and I have called variants (SNPs). So from the amplicon I would like to extract positions (let's say) 3, 10, 40, 345 and 850, which were detected as polymorphic (that in chromosomal coordinates, which is what remains in the bam and vcf/bed as they are mapped, are whatever millions). At the end I would like to have a file with a list of those particular 5 bases of every read, dumping all the positions in between (and of course, if there are, all the reads without those positions).

lindenb commented 3 years ago

With that samtools line it is filtering out the reads that do not have overlaps with the positions in the bed file, but it is not filtering out all the positions that are not in the bed in each read, no?

sorry, I don't get that sentence. If the BED file overlap any part of a read you'll get the whole read.

MiHern commented 3 years ago

Yes, but I don't think I need that. It is interesting to know, though, it might be useful for some other data. This is amplicon seq, so almost all of my reads cover the positions in the bed file. In fact, I used this same BAM file to call variants, so the SNPs found must be included in almost all of the reads. I don't need to filter out the reads that don't contain the positions I'd like to extract.

This is what the output of my Sam2Tsv call look like:

Screenshot 2021-07-27 at 11 12 22

If you check the READ-POS0 column, it is outputting all the positions of the reads, while what I'd like is to output only the ones with the yellow box, that are the ones in the vcf/bed file. But even doing --regions it doesn't do it. And with the samtools piping I get the same.

lindenb commented 3 years ago

oh, I think I understand. The --regions option is used to select input reads but there is no option to filter the reads on output. You could just filter this output with awk. Something like awk '$4=="Chr1" && $8=5187094'

MiHern commented 3 years ago

Ah ok, I understand now. I'll try that.

Thank you for the help!