Feature request: Compatibility with Starlong

adbeggs commented 1 year ago

Hi

I am trying to do fusion neoantigen prediction using Arriba, however STAR can't cope with long reads and ideally STARlong should be used - do you think this will be able to be supported please?

THanks

Andrew

suhrig commented 1 year ago

To some extent, this is already possible. Technically, Arriba is compatible with STARlong. I just gave it a spin with PacBio long reads and it worked.

The mismappers filter took a very long time to run. It should be disabled for long reads (-f mismappers), since there is a very slim chance of a long read being mapped to the wrong locus anyway.

There weren't any false positives. I am not sure about the false negative rate. Looking at the discarded fusions file, I mainly saw false negatives that were discarded due to lack of support (they had only 1 supporting read). So Arriba reported pretty much everything that can be reported with confidence. So if anything was missed, it was mainly because STAR did not find an alignment. I believe STAR should be able to find most fusions with ease given the long read length. It may have trouble with reads spanning multiple breakpoints (=> unmapped, too short), which may be ameliorated through STAR parameter tweaking. However, STAR will only ever report one chimera per read. Multiple chimeric alignments of the same read are supported by neither STAR nor Arriba.

In addition to disabling the mismappers filter, other Arriba parameters may improve the sensitivity some more. For example, one could reduce the minimum number of supporting reads to 1. I need to give this a try.

What made you think that Arriba is not compatible with STARlong? Did it fail on a sample of yours? Or did it miss fusions in a sample?

adbeggs commented 1 year ago

HI @suhrig - apologies after my post I managed to make it work fine with StarLong - sorry for wasting your time!

suhrig commented 1 year ago

No time wasted. I will make an enhancement that the mismappers filter is skipped for long reads.

If you can share any STAR or Arriba parameter optimizations that improve calling, let me know!

YinYangKarly commented 4 months ago

Hi

I got STARlong-arriba working on a sample. When I analyzed the results from arriba, I expected to find CIC-DUX4 fusion since that fusion was found while viewing the alignment in IGV viewer but this fusion was not found the fusions.tsv and the fusions.discarded. I used the -f mismappers and -S 1 next to all the other input arriba needed to run. Is there something I overlooked in arriba/STARlong parameters or is this because CIC-DUX4 is a challenging fusion to detect?

Thanks!

Karlijn

suhrig commented 4 months ago

In order to detect CIC-DUX4 fusions, it is important to enable multimapping chimeric reads. Did you use --chimMultimapNmax 50? If the fusion is not listed in either the fusions.tsv file or the discarded file, then likely it is an alignment issue and STAR failed to find chimeric alignments.

Would it be possible to share the BAM file with me or at least the reads mapping to the fusion breakpoints which you see in IGV?

YinYangKarly commented 4 months ago

Thank you for the quick response! Here is the command I used for STARlong: STARlong --runThreadN 4 --genomeDir indexes_STAR/ --readFilesIn <longreadFasta> --outSAMtype BAM SortedByCoordinate --outBAMcompression 0 --outFilterMultimapNmax 50 --peOverlapNbasesMin 10 --alignSplicedMateMapLminOverLmate 0.5 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentMin 10 --chimOutType WithinBAM HardClip --chimJunctionOverhangMin 10 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --chimSegmentReadGapMax 3 --chimMultimapNmax 50 --alignEndsProtrude 5 DiscordantPair --outSAMstrandField intronMotif --outFileNamePrefix <output_dir_and_prefix> --seedPerReadNmax 10000

I did use --seedPerReadNmax 10000 because it is a large fasta file. For showing the the mapping/BAM file, I need to discuss it with my supervisor since the data is confidential. I will get back at it as soon as possible.

YinYangKarly commented 4 months ago

It took a while long than expected to get a reply and confirmation from my supervisor. I sent you an email about it and a filesurfsender with the BAM file and its BAI file. I sent it to s.uhrig@dkfz.de but that mail address does not exist. To what email address can I send the filesurfsender?

suhrig commented 4 months ago

I no longer work at the DKFZ. You can send the link to this address instead: sebastian [dot] uhrig [at] googlemail [dot] com.

YinYangKarly commented 4 months ago

Thank you for your quick response! I send you the files via filesurfsender and sent an email. Let me know if you received both.

suhrig commented 4 months ago

@YinYangKarly I had a look at the BAM file. The problem is with STAR, not with Arriba. STAR fails to report a chimeric alignment for the supporting reads. It either aligns the part of the read that maps to CIC or it aligns the part of the read that maps to DUX4, but not both. I am not sure what the reason is. When I map the two parts of the reads individually, STAR finds an alignment. They are unique hits, too. So it has nothing to do with multimapping reads. What's even more strange is the fact that when I cut down the long read to ~60nt on both sides of the fusion junction, then STAR correctly reports a chimeric alignment! The problem must be a limitation of STARlong reporting chimeric alignments of long reads. This is not something I can fix. Maybe the alignment parameters can be tweaked in a way to make this work. I have tried a number of things, but nothing worked. I think it would be best to bring this up with the developer of STAR. He will know much better which parameters need tweaking or if this is a bug that needs fixing. You could send him an example read, such as read rb_E0.L.103631.

YinYangKarly commented 4 months ago

Thank you for looking into it! I will think about your advice and considerations. I have another question about arriba how it determines confidence in STAR short read alignment, because the CIC-DUX4 fusion we are looking for in the short read alignment was found in the discarded fusion file. While filtering out homologs filter it was in fusions.tsv file. Can I better open a new github issue about this or shall we discuss this via/over the mail?

suhrig commented 4 months ago

It would be best to open another issue about this if you don't mind.

suhrig / arriba

Feature request: Compatibility with Starlong #218