Closed Spartanzhao closed 1 year ago
I can send you the commands to build your own STAR index, but I need to know the following details:
Thank you so much, I think these are all the thing I need: I only need to run the hg19. I don't need viruses index. hs37d5.fa.gz, chromFa.tar.gz ,Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz, gencode.v19.annotation.gtf.gz, refGene.txt.gz, Homo_sapiens.GRCh37.87.chr.gtf.gz.
Appreciate your help!
declare -A ASSEMBLIES ASSEMBLIES[hs37d5]="/gdata01/user/zhaoyue01/test/ARRIBA/database/hs37d5.fa.gz" ASSEMBLIES[hg19]="/gdata01/user/zhaoyue01/test/ARRIBA/database/chromFa.tar.gz" ASSEMBLIES[GRCh37]="/gdata01/user/zhaoyue01/test/ARRIBA/database/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz"
declare -A ANNOTATIONS ANNOTATIONS[GENCODE19]="/gdata01/user/zhaoyue01/test/ARRIBA/database/gencode.v19.annotation.gtf.gz" ANNOTATIONS[RefSeq_hg19]="/gdata01/user/zhaoyue01/test/ARRIBA/database/refGene.txt.gz" ANNOTATIONS[ENSEMBL87]="/gdata01/user/zhaoyue01/test/ARRIBA/database/Homo_sapiens.GRCh37.87.chr.gtf.gz"
It seems like you have already done most of the work! Since you already have adapted the variables ASSEMBLY
and ANNOTATINON
, all that is left to do is to remove the use of wget
. This is accomplished simply by setting the variable WGET=cat
further down in the script on line 74. Let me know if you have trouble getting this to work.
BTW, you don't need to download all assemblies/annotations for a given genome version. It suffices to download just one (e.g., hs37d5+GENCODE19
).
Thanks, I'll take a try.
After I run:sh download_references.sh , It direct exit:I think it excute these code: if [ $# -ne 1 ] || [ -z "$1" ] || [ -z "${COMBINATIONS[$1]}" ]; then echo "Usage: $(basename $0) ASSEMBLY+ANNOTATION" 1>&2 echo "Available assemblies and annotations:" 1>&2 tr ' ' '\n' <<<"${!COMBINATIONS[@]}" | sort 1>&2 exit 1
However, After I delete these code I got this:
download_references.sh:line 36: COMBINATIONS: bad tuple download_references.sh:line 42: COMBINATIONS: bad tuple download_references.sh:line 52: ASSEMBLIES: bad tuple?
I think these code get something wrong:ASSEMBLY="${COMBINATIONS[$1]%+}" ANNOTATION="${COMBINATIONS[$1]#+}" echo "Downloading assembly: ${ASSEMBLIES[$ASSEMBLY]}"
Please refer to the manual on how to use download_references.sh
.
Notably, you should not call the script with sh
. It's not an sh
script, it is a bash
script. The script uses some features that sh
is not capable of. Either call the script without an interpreter as stated in the manual (./download_references.sh
) or with bash
as an interpreter (bash download_references.sh
).
Second, you must pass a combination of assemly + reference as as argument, e.g. hs37d5+GENCODE19
.
Thanks, I'll take a try.
I appreciate your help! Your ARRIBA works very well and successfully detected gene fusion we have known in one sample that others did not. But we want to further identify it through IGV. Here is the result:
PAX3 FOXO1 -/- -/- 2:223084859 13:41134997 CDS/splice-site CDS/splice-site translocation 121 1747 high in-frame Mitelman 'Paired_box'_domain(100%),Homeodomain(100%),Paired_box_protein7(100%)|Forkhead_domain(100%) . . ENSG00000135903.14 ENSG00000150907.6 ENST00000344493.4 ENST00000379561.5 upstream downstream duplicates(50) GCAGCTCTGCCTACTGCCTCCCCAGCACCAGGCATGGATTTTCCAGCTATACAGACAGCTTTGTGCCTCCGTCGGGGCCCTCCAACCCCATGAACCCCACCATTGGCAATGGCCTCTCACCTCAG|AATTCAATTCGTCATAATCTGTCCCTACACAGCAAGTTCATTCGTGTGCAGAATGAAGGAACTGGAAAAAGTTCTTGGTGGATGCTCAATCCAGAGGGTGGCAAGAGCGGGAAATCT SSAYCLPSTRHGFSSYTDSFVPPSGPSNPMNPTIGNGLSPQ|NSIRHNLSLHSKFIRVQNEGTGKSSWWMLNPEGGKSGK E100050154L1C002R0032905769,E100050154L1C002R0071774534,E100050154L1C010R0101840224,E100050154L1C011R0353450414,E100050154L1C012R0361510826,E100050154L1C016R0124214752,E100050154L1C023R0340397342,E100050154L1C024R0380865075,E100050154L1C027R0401969947,E100050154L1C029R0301873558,E100050154L1C029R0401675097,E100050154L1C035R0191290128,E100050154L1C037R0293192050,E100050154L1C040R0361662640,E100050154L1C042R0070367659,E100050154L1C001R0292132050,E100050154L1C002R0160323903,E100050154L1C002R0380461605,E100050154L1C005R0091459694,E100050154L1C007R0043626696,E100050154L1C007R0381242214,E100050154L1C008R0123712437,E100050154L1C010R0154342555,E100050154L1C010R0273069600,E100050154L1C011R0121934213,E100050154L1C012R0201184950,E100050154L1C013R0230947313,E100050154L1C013R0233078877,E100050154L1C018R0060557320,E100050154L1C019R0223103665,E100050154L1C020R0011315558,E100050154L1C020R0113081898,E100050154L1C020R0391351671,E100050154L1C023R0013396282,E100050154L1C023R0321024653,E100050154L1C023R0412162649,E100050154L1C024R0221173763,E100050154L1C025R0073138396,E100050154L1C025R0310937867,E100050154L1C026R0410258465,E100050154L1C028R0230277255,E100050154L1C028R0352447309,E100050154L1C031R0121392460,E100050154L1C032R0151998981,E100050154L1C032R0332434418,E100050154L1C033R0152996958,E100050154L1C034R0221387361,E100050154L1C035R0010656346,E100050154L1C035R0080010086,E100050154L1C035R0341964089,E100050154L1C036R0292264132,E100050154L1C038R0062791449,E100050154L1C039R0073818680,E100050154L1C040R0102715483,E100050154L1C040R0143199276,E100050154L1C040R0303189168,E100050154L1C041R0201682678 How Can I extract the reads that support two genes' fusion. Is the last column? It seems has 57 reads id. However, on these two columns: split_reads1 split_reads2, It shows only 3 and 4 reads support that there is RNA fusion respectively. How can I get these reads?
How Can I extract the reads that support two genes' fusion.
There is a script for exactly this propose:
https://github.com/suhrig/arriba/blob/master/scripts/extract_fusion-supporting_alignments.sh
Run it without arguments to learn about it's usage.
However, on these two columns: split_reads1 split_reads2, It shows only 3 and 4 reads support
The other reads are duplicates and were discarded. Take a look at the column filters
.
Is this panel-seq data? Do you use UMIs? If so, there may be an opportunity to improve Arriba's parameters.
Yes, this is panel-seq data, which parameters I can use to improve my results?
This explains the high duplication rate. Does your library have unique molecular barcodes (UMIs)? Arriba performs duplicate making solely based on the mapping coordinates. If you have UMIs, then duplicate marking can be improved by making use of the UMI information. If that's the case, you will need to perform duplicate marking with an external UMI-aware tool, and then pass the duplicate-marked alignments to arriba with the parameter -u
.
If you don't have UMIs, it might be advisable to disable duplicate removal altogether. This depends on how your panel works. Namely, when your panel is constructed in a way that all reads have the same start and end coordinates. But only in this case. In all other cases, it is recommended to leave duplicate filtering enabled.
Please take a look at the section about panel-seq data in the manual if you haven't already.
Okay, Thank you so much!
Do you have any further questions or can I close this issue?
No, Thank you so much ,and Thanks for your patients, You did a very good work and this software helped me a lot. It can detect fusion genes that others can not.
Glad to hear! I'm closing this issue, then. If you run into any other problems, feel free to reach out to me again.
My server can not use the wget, I downloaded the data from internet and uploaded to my server? could your provide me a download_references.sh without wget? I want to build STAR index with my downloaded file.