pjedge / longshot

diploid SNV caller for error-prone reads
MIT License
190 stars 26 forks source link

Longshot on Iso-seq data #106

Open camelest opened 2 months ago

camelest commented 2 months ago

Hi, thank you so much for the great tool.

I'm trying to apply longshot on long-read RNA-seq data (Iso-seq data from PacBio HiFi reads). I ran the pipeline with default options and got ~30% reads assigned with HP tags (Is is true to assume HP:i:1 represents reads with REF SNPs and HP:i:2 with ALT SNPs?) Do you have any recommendations to recover more reads?

I was wondering whether I should loosen

-I, --max_cigar_indel <int>                Throw away a read-variant during allelotyping if there is a CIGAR indel
                                               (I/D/N) longer than this amount in its window. [default: 20]

since the RNA-seq reads CIGAR skips large introns.

Thank you so much for your kind help.

vibansal commented 2 months ago

One way to avoid excluding these reads would be to modify the bam/cram file to split the reads containing long 'N' cigar operations into multiple reads. See for example:

https://gatk.broadinstitute.org/hc/en-us/articles/360036858811-SplitNCigarReads

camelest commented 1 month ago

@vibansal Thank you so much for your reply. Sorry for my late reply, I somehow did not receive the notice.

I tried with SplitNCigarReads. For some regions, even though the reads look pretty nice as a SNP, it got filtered out at the final step. (included in 4.0.final_genotypes.vcf but not in the final result) Does it have something to do with other settings?

The output in 4.0.final_genotypes.vcf is

chrX    103482802       .       T       G       1       dp      DP=9850;AM=0;MC=0;MF=0.000;MB=0.000;AQ=25.62;GM=0;PH=0.00,37.78,37.78,37.78;SC=None;    GT:GQ:DP:AD:PS:UG:UQ    0/0:34:9850:5355,4495:.:0/1:127.63
vibansal commented 3 weeks ago

Sorry for the late reply. The variant is filtered since it has very high depth (9850). The "-max_cov" option can be used to increase the threshold.

camelest commented 3 weeks ago

Thank you so much for the reply. I tried with --max_cov 100000 and now the dp changed to PASS but I still don't find it in the final output. Does the final quality filtering occur at this step?

(in 4.0.final_genotypes.vcf)

chrX    103482802       .       T       G       1       PASS    DP=9850;AM=0;MC=0;MF=0.000;MB=0.000;AQ=25.62;GM=0;PH=0.00,37.78,37.78,37.78;SC=None;    GT:GQ:DP:AD:PS:UG:UQ    0/0:34:9850:5355,4495:.:0/1:127.63
vibansal commented 5 days ago

The genotype of the variant is 0/0 and the final output contains only variants with non-ref genotypes.