Arriba error while running STAR outputs

lucascaue commented 4 years ago

Hi! I am trying to use arriba, but wiith some errors:

Loading assembly from 'genome.fna' ERROR: could not find sequence of contig '10'

That was my Arriba command-line: arriba -c S9Chimeric.out.junction -x S9Aligned.sortedByCoord.out.bam -g genannotation.gtf -a genome.fna -f blacklist -o fusions.tsv -O fusions.discarded.tsv-d S9_sorted.vcf -s auto -V 0.1 -T -P -I I was wondering if the problem would be with the chromosomes names into my files, then I tried to to stating them with -i options of contig but with no progress. Do you have any clue what is going on?

Thanks in Advance!

suhrig commented 4 years ago

Upfront some comments about your command:

I'm assuming you try to run Arriba on non-human data. Is that correct? What organism is it, if I may ask? Probably that is why you disable the blacklist and why your files have generic names. Generally speaking that's fine, I'm just curious so that I can better judge the situation. Just beware that you will have more false positives than usual if you don't use a blacklist. Also beware that Arriba only reports fusions if at least one breakpoint is near an annotated exon as specified by the GTF file. You need to disable the filter intronic to have it report fusions with both breakpoints in introns or intergenic regions. This might be relevant in case the gene annotation of your organism is incomplete or in case you expect many breakpoints in introns/intergenic regions.
Second, you should not pass the junction file to Arriba. The parameter -c takes alignments as input, i.e. Chimeric.out.sam. If you only have the junction file, then probably you forgot to specify --chimOutType. See this page for some explanation or have a look at the demo script for an example how to run STAR.
There should be a blank between fusions.discarded.tsv and -d.
The structural variant file passed via -d should not be in VCF format. It is expected to have a simple 4-column TSV format as described in the manual. Out of curiosity, what tool did you use to call the structural variants? I might be able to help you with format conversion.
The default value of -s is auto, so you don't necessarily need to specify it.
The value of 0.1 for the parameter -V might be to strict. It effectively means that reads of 100nt length are discarded, if they have more that two mismatches. This might be intentional in your case, but the default value permits a few more mismatches, which has proven to work better in practice in my experience.

Now to the actual error:

Your suspicion is probably spot-on that the parameter -i needs to be adapted. This parameter specifies which contigs are reported by Arriba in the main output file. Fusions involving other contigs are discarded. For every contig mentioned in the argument of -i, Arriba checks if it can find the sequence of this contig in the assembly (genome.fna). The default value are all human chromosomes. So if your organism has fewer chromosomes, the given error will be thrown. Here is what you need to do: Find the names of all contigs of your organism. Let's say 1, 2, and 3 for example. Then pass this list as an argument to -i, for example: -i "1 2 3".

Feel free to ask for more details if any of my explanations are not clear. BTW all arguments are documented in the manual, in case you have not found it yet.

lucascaue commented 4 years ago

Hi! Yes I am using a hybrid Human genome with provirus. The VCF file was made by bcftools mpileup command. I followed your instructions, I changed the --ChimOutTypeto --chimOutType WithinBAM and suppressed the -V and -d commands. Now, I can run arriba with no errors, but 99.9% of the mismatches were filtered in this step:

Filtering duplicates (remaining=102394) Filtering mates which do not map to interesting contigs (remaining=1) Estimating mate gap distributionWARNING: not enough chimeric reads to estimate mate gap distribution, using default values

I am pretty sure that are at least 50 chimeric alignments. Do you have any idea what should I do now? Maybe one of the filters of -f should be suppressed?

Thank you so much for your help!

suhrig commented 4 years ago

I understand you are looking for sites of viral integration into the human genome? In that case, you must specify both the human and the viral chromosomes as an argument to -i. In addition, I recommend disabling the intronic filter, otherwise you will miss some events where the virus is integrated into intergenic / intronic loci.

When you used --chimOutType WithinBAM you don't need -c, because the chimeric alignments are stored in the main alignments file passed via -x. I guess you figured this out yourself; I just want to mention this in case you haven't.

suhrig commented 4 years ago

Accidently hit the close issue button. Reopening...

suhrig commented 4 years ago

The output of bcftools mpileup is of no use to Arriba. Mpileup calls SNVs and indels, not structural variants, so it makes sense that you removed this argument.

How many reads map to the human and to the viral chromosomes? Have you checked using samtools idxstats?

lucascaue commented 4 years ago

Hi! Yes I did that modifications you suggest! I realized that I have to declare the chromosomes with -i CONTIGS but just COMMA separeted list , when I declared something like -i 1 2 3 4 , arriba just recognize the first string and when I typed -i 1,2,3,4 it did recognize all 4 of them. Before this process I had to run STAR with the tag --chimOutType Junctions for some reason I can not use --chimOutType WithinBAM Junction. Then as I got the chimeric alignments file I can identify the best chimeric alignments and where are the breakpoints and annotate them to declare on -i CONTIGS. I Think right now I just have to play a little bit with the filter parameters.

I don't know how to understand the samtools idxstats

Thanks so much for your help suhrig!

suhrig commented 4 years ago

when I declared something like -i 1 2 3 4 , arriba just recognize the first string

Other users have run into the same issue. It is actually not an error of Arriba, but invalid bash syntax. You would run into the same issue, if you tried to dump the content of a file using cat if the file name contains a space, e.g. cat foo bar will issue an error:

cat: foo: No such file or directory
cat: bar: No such file or directory

The proper BASH syntax is to put it in double quotes: cat "foo bar"

To prevent more users from bumping into this obstacle future versions of Arriba will issue a clear error message.

I had to run STAR with the tag --chimOutType Junctions for some reason I can not use --chimOutType WithinBAM Junction

Note that WithinBAM is mandatory, if you want to use Arriba for fusion detection. Specifying Junctions alone will not work. If your version of STAR does not permit specifying Junctions and WithinBAM at the same time, you need to either drop Junctions or add --chimMultimapNmax 0. See this issue from the developer of STAR.

BTW, my version of STAR (2.7.3a) permits the use of WithinBAM and Junctions at the same time. Maybe you made a typo. You write Junction once and Junctions another time.

lucascaue commented 4 years ago

Exactly! I saw that post before. Now I could do the Arriba Analysis, and then following the script I ran draw_fusions.R but I got this error:

Loading annotation Loading protein domains Drawing fusion #1: UBE2E2:tat(4610) Drawing fusion #2: LOC105371031:tat Error in if (bands[band, "giemsa"] != "acen") { : missing value where TRUE/FALSE needed Calls: drawIdeogram Execution halted

Do you know What is this about? I did not find that on draws._fusion.Rcode

suhrig commented 4 years ago

You need to add a line to the cytobands file (tab-separated):

tat 1   SIZE_OF_tat tat gneg

where SIZE_OF_tat is the size of the tat chromosome.

The script crashes, because it looks for information about Giemsa staining of the tat chromosome, but can't find any. I should probably make the script robust enough that it does not crash, when this information is missing. Good point, thanks for making me aware of this!

lucascaue commented 4 years ago

Hi, Thanks for your comments. Definitely I had the identification for my fusions, and now I am just having this problem when I am run draw_fusions.R

draw_fusions.R --annotation=genannotation_mod.gtf --fusions=S9fusions_mod.tsv --output=S9output.pdf --alignments=S9Aligned.sortedByCoord.out.bam --cytobands=cytobands_mod.tsv --proteinDomains=protein_domains_sorted.gff3 --minConfidenceForCircosPlot=low

Loading annotation Loading protein domains Drawing fusion #1: LPP:virtual-gene Error in if (confidenceRank[f$confidence] >= confidenceRank[minConfidenceForCircosPlot] || : missing value where TRUE/FALSE needed Calls: drawCircos Execution halted

Do you know what is happening here? I think this is the last problem, I could not find this error identification on Arriba code. Even when I omitted minConfidenceForCircosPlot=low I got this error

suhrig commented 4 years ago

Did you modify the fusions file? My guess is the confidence column was shifted/accidentally edited and now contains an invalid value. It must contain either high, medium, or low. Any other value will trigger this error. Can you double-check that the fusions file is in correct format?

lucascaue commented 4 years ago

Hi. I checked that, it was made by a blank row in the final of the file. Now it is working perfectly! Thank you so much for your help!

suhrig commented 4 years ago

A new version of Arriba has just been released. It includes fixes for the issues mentioned in this thread. Specifically, a clear error message is thrown, when a user forgets to wrap arguments with blanks in quotes. Moreover, a warning is issued when the junctions file is passed to Arriba. And the draw_fusions.R script no longer chokes when cytoband information is missing. I'm closing this. Feel free to reopen, if you still have issues.

suhrig / arriba

Arriba error while running STAR outputs #33