timbitz / Aligater

Software suite for detection/analysis of chimeric RNAs from LIGR-seq data
MIT License
2 stars 1 forks source link

aligater stats error #7

Open fgypas opened 7 years ago

fgypas commented 7 years ago

Hi again.

I tried to run the following step:

foregroundfile="sample_rep1-%.final.lig"
backgroundfile="sample_rep1-%.expression.lig"

nameParam="--fore $foregroundfile --back $backgroundfile --nd xlink,unx"

totalAMTreads=142355952
totalMOCKreads=160164420

normParam="--nc $totalAMTreads,$totalMOCKreads"

aligater stats $nameParam $normParam > output.pvl

But after a point I get the following error message:

[aligater stats]: Loading Packages.. Loading Background sample_rep1-xlink.expression.lig.. Processing probability distribution.. Loading Foreground sample_rep1-xlink.final.lig.. ERROR: LoadError: BoundsError: attempt to access 19-element Array{SubString{ASCIIString},1}: "S" "A:A" "1:2" "DPP9:DPP9" "ENSG00000142002.12:ENSG00000142002.12" "ENST00000598041.1:ENST00000598041.1" "protein-coding:protein-coding" "tRNA-Val-GTG_tRNA:tRNA-Val-GTG_tRNA" "tRNA:tRNA" "D00535:178:H3VVCBCXY:1:1104:4087:1975" "gggggaaacaccACGCGAAAGGTCCCCGGTTCGAAACCGGGCGGAAACAC_CACGCGAAAGGTCCCCGGTTCGAAACCGGGCGGAAACAccacgcgaaaggt" "96" "1,1>1[1]>36" "38,2" "72,36" "38,38" "chr19:4724647:-,chr19:4724683:-" "tRNA:tRNA" "tRNA-Val-GTG_tRNA:tRNA-Val-GTG_tRNA" at index [25] in loadInteractionFile at Aligater/bin/stats.jl:160 in main at Aligater/bin/stats.jl:414 in include at ./boot.jl:261 in include_from_node1 at ./loading.jl:333 in process_options at ./client.jl:280 in _start at ./client.jl:378 while loading Aligater/bin/stats.jl, in expression starting on line 437

It seems that it tries to access column 25, although the files have 19 columns. Any ideas since i am not familiar with Julia?

timbitz commented 7 years ago

Hi @fgypas It looks like you did not run the RactIP step (which is ok!), though this step does add several columns, which are expected by default (hence the column 25 lookup). You can get around this by utilizing the --gi option to specify the column containing the final gene_id:gene_id fields ( which should be the last column, here 19 ). Also feel free to run the stats.jl command with a julia bin/stats.jl -h flag to see the command line options. I apologize for not having this info in the Readme.. I added those options so that someone could omit post stages, but have not done that myself yet.

fgypas commented 7 years ago

Hi @timbitz and thank you for the quick response. Indeed this was the issue (i did not RactIP) and when I specified the option --gi 19 it overpassed this problem, but now I encountered another problem. Can you take a look on the following log output?

Loading Background sample_rep1-xlink.expression.lig.. Processing probability distribution.. Loading Foreground sample_rep1-xlink.final.lig.. Calculating binomial stats.. Loading Background sample_rep1-unx.expression.lig.. Processing probability distribution.. Loading Foreground sample_rep1-unx.final.lig.. Calculating binomial stats.. ERROR: LoadError: MethodError: keys has no method matching keys(::Void) in printVardict at Aligater/bin/stats.jl:295 in printStats at Aligater/bin/stats.jl:350 in main at Aligater/bin/stats.jl:432 in include at ./boot.jl:261 in include_from_node1 at ./loading.jl:333 in process_options at ./client.jl:280 in _start at ./client.jl:378 while loading Aligater/bin/stats.jl, in expression starting on line 437

Any idea here?

timbitz commented 7 years ago

Hmm. My guess is that it is trying to calculate column stats for certain types of columns that are not compatible or non-existing?.. hence the call to keys with a Void type. Do you have a --vs string? What does it look like? The default I supplied --vs 18:f,24:p,21:f,22:d,23:d tries to access columns 21 through 24. This will need to be modified for the 19-column lig file, for example, --vs 18:p should be analogous to --vs 24:p

fgypas commented 7 years ago

No, I did not specify any --vs string... Just the default, which according to the options is empty ""

fgypas commented 7 years ago

An example of a column looks like the following: 1) S 2) A:A 3) 7:10 4) SH3D19:SH3D19 5) ENSG00000109686.12:ENSG00000109686.12 6) ENST00000455740.1:ENST00000304527.4 7) protein-coding:protein-coding 8) NA:NA 9) NA:NA 10) D00535:178:H3VVCBCXY:2:1108:10191:6676 11) ggcCACCCAGGTCAAACAGGAGGTTTTGTGCGAGTACCCCCAAGGTTGCCACCGA_GCTGTCTGTGCCTCATGGAATTGCCAATGAAGATATTGTCTCTCAA 12) 176 13) 1,1>16[2]>68 14) 2066,2332 15) NA,NA 16) 54,46 17) chr4:152069339:-,chr4:152065440:- 18) protein-coding:protein-coding 19) SH3D19:SH3D1

fgypas commented 7 years ago

At the end if I run the following command I get some output. aligater stats $nameParam $normParam --gi 5 --vs 18:p > output.pvl

But it is not clear to me what each column mean, since there is no header. Here is a line of the file: ENSG00000201466.1,ENSG00000202354.1 0.5 0.7 1.4 3 1.1 5 2.3 1.00e+00 1.00e+00 0.0 0.0 18 misc-RNA,misc-RNA_scRNA

timbitz commented 7 years ago

Hi @fgypas, I apologize for not having a header! I will correct that and add some more documentation as to how to interpret and filter this output. On another note, you should probably be using --gi 19 as this is the better of the two gene_id columns created by reclass. For now, the headers in the .pvl file can be found in Sup Table 1 of the LIGR-Seq/Aligater publication:

Extended Data Table 1
Gene-ids : Comma delimited HUGO or Repeat family name in lexographical order
OE[+amt/-amt] : (+AMT/-AMT) / (Expected(+AMT)/Expected(-AMT))
AMT/Mock : (+AMT/-AMT)
Exp[AMT/Mock] : Expected(+AMT)/Expected(-AMT)
AMTReads : Number of reads in the +AMT sample + 1 pseudo count
OE-amt : Observed / Expected for +AMT sample (see Methods)
MockReads : Number of reads in the -AMT sample + 1 pseudo count
OE-mock : Observed / Expected for -AMT sample (see Methods)
AMT+ pval : Binomial p-value for significance in the +AMT sample
Mock pval : Binomial p-value for significance in the -AMT sample
AMT RPM : The transcript's expression in Reads per Million from -ligase sample
Mock RPM : The transcript's expression in Reads per Million from -ligase sample
Transcript Biotypes : Gene biotypes from GENCODE annotations
... any other summary line from --vs string
fgypas commented 7 years ago

Thanks @timbitz for the constant support. I really appreciate it. It is really helpful. By the way I think there is another typo. According to line 200 of the "Aligater/bin/stats.jl" the ASCIIString is specified by s and not by c that it is mentioned in the documentation.

timbitz commented 7 years ago

Yah that is definitely a typo in the docs.. thanks for pointing that out!