vastgroup / vast-tools

A toolset for profiling alternative splicing events in RNA-Seq data.
MIT License
77 stars 29 forks source link

vast-tools merge error in last stable version (v2.4.0) #87

Closed PedroBarbosa closed 4 years ago

PedroBarbosa commented 4 years ago

Hi,

I've using vast-tools quite successfully in the past using a Docker containers, particularly using an image with the v2.1 installed: biocorecrg/vast-tools:2.1.3.

Recently I updated the image for the latest version using the official image from the tool repository (vastgroup/vast-tools:v2.4.0). I also updated the vastDB local files to the latest hg38 version.

The alignment stage went ok (although needed to set up IR_version 1 because the v2 was taking too long) and the whole pipeline (combine, tidy, compare, diff) runs smoothly when I don't merge my samples. However, when I try to merge the data based on my custom groups, I get the following error using the latest image:

Image v2.4.1 throws the same error, while v2.3.0 expected different naming of the vastDB (thus I didn't dare to test)

Any help ? Cumps, Pedro

readline() on closed filehandle VERSION at /usr/local/bin/vast-tools line 65.
Use of uninitialized value $version in scalar chomp at /usr/local/bin/vast-tools line 66.
readline() on closed filehandle VERSION at /usr/local/vast-tools/bin/MergeOutputs.pl line 82.
Use of uninitialized value $version in scalar chomp at /usr/local/vast-tools/bin/MergeOutputs.pl line 83.
[vast merge]: VAST-TOOLS vNo version found
[vast merge]: Setting output directory to /mnt/nfs/lobo/MCFONSECA-NFS/mcfonseca/shared/chris/splicing/vastools/second_analysis_v2.4
[vast merge]: S_3y_CTRL: found to_combine/S_3y_CTRL.info. Sample will be treated as being strand-specific.
[vast merge]: S_10y_CTRL: found to_combine/S_10y_CTRL.info. Sample will be treated as being strand-specific.
[vast merge]: S_11y_CTRL: found to_combine/S_11y_CTRL.info. Sample will be treated as being strand-specific.
[vast merge]: S_3y_CBE: found to_combine/S_3y_CBE.info. Sample will be treated as being strand-specific.
[vast merge]: S_10y_CBE: found to_combine/S_10y_CBE.info. Sample will be treated as being strand-specific.
[vast merge]: S_11y_CBE: found to_combine/S_11y_CBE.info. Sample will be treated as being strand-specific.
[vast merge]: Loading IR files (version 1)
[vast merge]:   Processing to_combine/S_10y_CBE.IR
[vast merge]:   Processing to_combine/S_10y_CTRL.IR
[vast merge]:   Processing to_combine/S_11y_CBE.IR
[vast merge]:   Processing to_combine/S_11y_CTRL.IR
[vast merge]:   Processing to_combine/S_3y_CBE.IR
[vast merge]:   Processing to_combine/S_3y_CTRL.IR
[vast merge]: Loading Microexon files
[vast merge]:   Processing to_combine/S_10y_CBE.micX
[vast merge]:   Processing to_combine/S_10y_CTRL.micX
[vast merge]:   Processing to_combine/S_11y_CBE.micX
[vast merge]:   Processing to_combine/S_11y_CTRL.micX
[vast merge]:   Processing to_combine/S_3y_CBE.micX
[vast merge]:   Processing to_combine/S_3y_CTRL.micX
[vast merge]: Loading eej2 files
[vast merge]:   Processing to_combine/S_10y_CBE.eej2
[vast merge]:   Processing to_combine/S_10y_CTRL.eej2
[vast merge]:   Processing to_combine/S_11y_CBE.eej2
[vast merge]:   Processing to_combine/S_11y_CTRL.eej2
[vast merge]:   Processing to_combine/S_3y_CBE.eej2
[vast merge]:   Processing to_combine/S_3y_CTRL.eej2
Use of uninitialized value $p in array element at /usr/local/vast-tools/bin/MergeOutputs.pl line 467, <I> line 58.
Use of uninitialized value $n in addition (+) at /usr/local/vast-tools/bin/MergeOutputs.pl line 467, <I> line 58.
Use of uninitialized value $n in addition (+) at /usr/local/vast-tools/bin/MergeOutputs.pl line 468, <I> line 58.
[vast merge error]: Sum of positions ne total provided for ENSG00000166913      24-16 in to_combine/S_3y_CTRL.eej2

Use of uninitialized value in concatenation (.) or string at /usr/local/vast-tools/bin/MergeOutputs.pl line 460, <I> line 59.
Use of uninitialized value in addition (+) at /usr/local/vast-tools/bin/MergeOutputs.pl line 461, <I> line 59.
Use of uninitialized value in split at /usr/local/vast-tools/bin/MergeOutputs.pl line 464, <I> line 59.
Use of uninitialized value in string ne at /usr/local/vast-tools/bin/MergeOutputs.pl line 470, <I> line 59.
[vast merge error]: Sum of positions ne total provided for :1,2:23,3:56,4:33,5:72,6:59,7:41,8:33,9:35,10:34,11:37,12:19,13:47,14:36,15:113,16:46,17:64,18:70,19:28,20:20,21:7,2
mirimia commented 4 years ago

Hi Pedro,

This is strange. It may be a problem with the align step for S_3y_CTRL. Could you retry the merge without that sample?

Also, I see it cannot find the VERSION file. Does that happen for other modules?

Thanks

PedroBarbosa commented 4 years ago

This is strange. It may be a problem with the align step for S_3y_CTRL. Could you retry the merge without that sample?

Yes, It worked now. I will realign that sample. Thank you.

Also, I see it cannot find the VERSION file. Does that happen for other modules?

Yes, It happens in all the modules. Also, a minor request: it would be useful to have an option to set the group name in compare/tidy, rather than taking the name of the first replicate. I think in the diff module there is that option. Otherwise I'm getting things like this printed after compare:

Use of uninitialized value $version in scalar chomp at /usr/local/vast-tools/bin/ComparePSI.pl line 120. [vast compare]: VAST-TOOLS vNo version found [vast compare]: Species assembly: hg38, VASTDB Species key: Hs2 [vast compare]: Doing comparisons of AS profiles (S_3y vs S_3y)

Best, Pedro

mirimia commented 4 years ago

Good.

OK, I'll add that option and push it ASAP.

Thanks

PedroBarbosa commented 4 years ago

Hi again,

I'm reopening this issue because the problem hasn't been solved. I'm having troubles at the merge stage for this same sample.

I think the alignment went fine, but please take a look at the log. Do you see anything peculiar?

1050358_vast_align.log

Best, Pedro

mirimia commented 4 years ago

Hmm, the log looks ok. Can you send me all the outputs from align by email? I’ve never seen that error...

PedroBarbosa commented 4 years ago

Ok, I realised what the problem was. There was a minor typo in the sample name. Bah!! Everything is fine, thank you.

mirimia commented 4 years ago

Good.

I have already added the requested option to provide sample group names to compare and compare_expr. Just do a pull request. (I didn't know what you meant for tidy?). Thanks!

PedroBarbosa commented 4 years ago

Oh right, it's for compare only (I was seeing twice in the logs because of two compares (--print_dPSI and without) and didn't pay careful attention.

Thanks very much for the help. Pedro

PedroBarbosa commented 4 years ago

Dear @mirimia ,

I'm afraid I need to reopen this issue, as I'm having the same issue with another dataset. The exact same pipeline works fine for other datasets, but on this one It doesn't. I double checked file names and alignment logs, everything seems ok. Could you look at this yourself and try to merge the files I'm sending below (1.4Gb) ?

https://www.dropbox.com/sh/22meq64u8f61s69/AAAEM2RrWPoyLkVfC347oGjZa?dl=0

Everything is there: to_combine folder with all necessary data and a file with the groups and samples to merge. I also provide the log with the errors I get another one with the logs of the align step. I'm using the last stable docker image (2.4.0)

Could you just let me know when you download the data, so I can remove from dropbox. Thank you very much.

Best, Pedro

mirimia commented 4 years ago

Hi Pedro,

This is very strange. I have never saw this issue before. And I really merged and processed a lot of samples with vast-tools. Perhaps fewer with Hs2, so I wonder whether the issue is with this species somehow, but I don't see why this should happen. Specially, since those failed junctions are fined sometimes. What happens is that the string with the number of reads per position is not properly printed in the eej2 file. This string is only descriptive, and is not used at all, but it makes merge crash. It looks as if it's unfinished, so perhaps it's a memory issue?

Have you tried to run align again on the files that give the error? It'd be important if the same error occurs again. If not, I'd assume it's a strange memory issue, but I'd assume the job it would crash in the cluster (and the align log looks OK).

Please run it again for the failed samples (perhaps increasing the allocated memory, although 6-8GB should be enough), and send me the new *eej2 files.

Thanks Manu

PedroBarbosa commented 4 years ago

I did try to align again, but didn't change the memory settings in the cluster sbatch script. However, by default it's already 75Gb per task.

I'll align again and will let you know. Best, Pedro

PedroBarbosa commented 4 years ago

Ah , one thing that came to my mind now: this dataset contains data from different sequencing runs for the same sample. Therefore, i had some samples with different read lengths depending on the sequencing instrument used. I don't know if that could affect somehow on the align stage.

mirimia commented 4 years ago

75GB should be more than enough (so I'm even more lost). Also, the different read lengths should not matter. So, let's try:

PedroBarbosa commented 4 years ago

Align is still running, but for the samples that have finished I tried to merge, and the problem remains.

1162136_vast_splicing.log

Could it be that somehow accessing the vastDB files simultaneously in the align stage is the reason? I run all samples in parallel using job arrays across different computing nodes.

mirimia commented 4 years ago

That could be a reason, but I somehow doubt it.

I pushed a commit with a small change on Analyze_COMBI.pl, which is the script that prints those lines. It should give the same output (when it works well), but hopefully more efficiently.

Please pull the latest version (or just substitute the script) and re-run all the samples. Hopefully this will fix the issue!

Thanks!

mirimia commented 4 years ago

OK, what is even more bizarre is that the lines that give errors now are different from the other run. Clearly, there is something randomly interrupting the normal process of read count.

What you meant is that you run all the samples in parallel in a cluster, right? That's certainly not a problem (I do it all the time for dozens of samples). Perhaps the issue comes from using multiple cores. I saw you used 10; I normally use 4 and have run with up to 16 many times and never saw this problem. But currently this is the other thing I could think about.

So, we can try the following:

Thanks again

mirimia commented 4 years ago

Checking the main script of align, I don’t think the issue is the number of cores either. Btw, if possible, please send me the eej2 of the different runs when they are done, so I can compare them properly.

PedroBarbosa commented 4 years ago

I pushed a commit with a small change on Analyze_COMBI.pl, which is the script that prints those lines. It should give the same output (when it works well), but hopefully more efficiently.

No difference with the new version. Still get the out of memory error.

/home/pedro.barbosa/git_repos/vast-tools/vast-tools merge -g /mnt/nfs/lobo/MCFONSECA-NFS/mcfonseca/shared/yvan_HCM_rnaseq/splicing_analysis/vastools/test.txt --sp hg38 --dbDir /home/mcfonseca/shared/genomes/human/hg38/vast-tools/ -o /mnt/nfs/lobo/MCFONSECA-NFS/mcfonseca/shared/yvan_HCM_rnaseq/splicing_analysis/vastools --IR_version 1
[vast merge]: VAST-TOOLS v2.4.1
[vast merge]: Setting output directory to /mnt/nfs/lobo/MCFONSECA-NFS/mcfonseca/shared/yvan_HCM_rnaseq/splicing_analysis/vastools
[vast merge]: F1505922_Ctrl_R1: found to_combine/F1505922_Ctrl_R1.info. Sample will be treated as being strand-specific.
[vast merge]: F1505923_Ctrl_R1: found to_combine/F1505923_Ctrl_R1.info. Sample will be treated as being strand-specific.
[vast merge]: F1505927_ICM_R1: found to_combine/F1505927_ICM_R1.info. Sample will be treated as being strand-specific.
[vast merge]: F1505929_ICM_R1: found to_combine/F1505929_ICM_R1.info. Sample will be treated as being strand-specific.
[vast merge]: F1505930_ICM_R1: found to_combine/F1505930_ICM_R1.info. Sample will be treated as being strand-specific.
[vast merge]: F1505931_ICM_R1: found to_combine/F1505931_ICM_R1.info. Sample will be treated as being strand-specific.
[vast merge]: F1505933_ICM_R1: found to_combine/F1505933_ICM_R1.info. Sample will be treated as being strand-specific.
[vast merge]: Loading IR files (version 1)
[vast merge]:   Processing to_combine/F1505922_Ctrl_R1.IR
[vast merge]:   Processing to_combine/F1505923_Ctrl_R1.IR
[vast merge]:   Processing to_combine/F1505927_ICM_R1.IR
[vast merge]:   Processing to_combine/F1505929_ICM_R1.IR
[vast merge]:   Processing to_combine/F1505930_ICM_R1.IR
[vast merge]:   Processing to_combine/F1505931_ICM_R1.IR
[vast merge]:   Processing to_combine/F1505933_ICM_R1.IR
[vast merge]: Loading Microexon files
[vast merge]:   Processing to_combine/F1505922_Ctrl_R1.micX
[vast merge]:   Processing to_combine/F1505923_Ctrl_R1.micX
[vast merge]:   Processing to_combine/F1505927_ICM_R1.micX
[vast merge]:   Processing to_combine/F1505929_ICM_R1.micX
[vast merge]:   Processing to_combine/F1505930_ICM_R1.micX
[vast merge]:   Processing to_combine/F1505931_ICM_R1.micX
[vast merge]:   Processing to_combine/F1505933_ICM_R1.micX
[vast merge]: Loading eej2 files
[vast merge]:   Processing to_combine/F1505922_Ctrl_R1.eej2
Use of uninitialized value $p in array element at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 470, <I> line 26.
Use of uninitialized value $n in addition (+) at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 470, <I> line 26.
Use of uninitialized value $n in addition (+) at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 471, <I> line 26.
[vast merge error]: Sum of positions ne total provided for ENSG00000031823  53-41 in to_combine/F1505922_Ctrl_R1.eej2

Use of uninitialized value in concatenation (.) or string at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 463, <I> line 27.
Use of uninitialized value in addition (+) at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 464, <I> line 27.
Use of uninitialized value in split at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 467, <I> line 27.
Use of uninitialized value in string ne at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 473, <I> line 27.
[vast merge error]: Sum of positions ne total provided for :1,13:1,16:1,17:4,18:2,19:1,20:3,21:5,22:3,23:6,24:2,25:3,26:3,27:2,28:6,29:3,30:2,32:4,33:5  in to_combine/F1505922_Ctrl_R1.eej2

Use of uninitialized value $p in array element at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 470, <I> line 83.
Use of uninitialized value $n in addition (+) at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 470, <I> line 83.
Use of uninitialized value $n in addition (+) at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 471, <I> line 83.
[vast merge error]: Sum of positions ne total provided for ENSG00000174444  20-13 in to_combine/F1505922_Ctrl_R1.eej2

Use of uninitialized value in concatenation (.) or string at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 463, <I> line 84.
Use of uninitialized value in addition (+) at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 464, <I> line 84.
Use of uninitialized value in split at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 467, <I> line 84.
Use of uninitialized value in string ne at /mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/git_repos/vast-tools/bin/MergeOutputs.pl line 473, <I> line 84.
[vast merge error]: Sum of positions ne total provided for :1,13:34,14:83,15:18,16:93,17:45,18:49,19:156,20:64,21:39,22:9,23:3,25:6,26:6,27:1,28:4,29:1,30:2,32:2,33:1,34:5  in to_combine/F1505922_Ctrl_R1.eej2
  • you run it with just one core (for the test, you can use --onlyEX, which will be much faster)

I will try that.

  • if you want, so can send me one of the problematic FQ files and I try it here.

I can send you the two controls I was trying to merge. They are too big, any recommendation on how to share those, or a subsample may suffice ?

mirimia commented 4 years ago

Sorry, I wasn’t clear: the new script is for align. Merge is actually working properly (ie. It’s finding an unfinished line in the output of align and it crashes).

So, the issue is related to the align step, let’s see if these two things fix it.

PedroBarbosa commented 4 years ago

Ah ok. It's running, let's see how it goes, although I did not find the option --onlyEX

mirimia commented 4 years ago

Indeed, it’s --noIR ... sorry about that!

PedroBarbosa commented 4 years ago

Hi,

It looks like everything is fine now :) align logs are as before, but now the merge step doesn't print anything wrong.

ps: i still used 10 cores.

mirimia commented 4 years ago

Good! Let me know if this happens with other samples again in the future.

(If you send me the eej2 from different runs, I can make sure they look consistent)