vpc-ccg / pamir

Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals
BSD 3-Clause "New" or "Revised" License
8 stars 4 forks source link

Error in rule sam_sort: #44

Closed christinafliege closed 4 years ago

christinafliege commented 4 years ago

Good Afternoon,

Pamir has started running on this project, and I get 5% through before it aborts with the following error. I have run that shell command separately and receive the same error. The file that is created by the previous step does not appear to be actually truncated. Thank you!


samtools sort: truncated file. Aborting
[Fri May 29 20:37:22 2020]
Error in rule sam_sort:
    jobid: 105
    output: /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/003-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sorted.sam
    shell:
        samtools sort  /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/003-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sam -m 8G -@ 1 -O SAM -o /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/003-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sorted.sam
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)   
f0t1h commented 4 years ago

Can you post the config file for the run?

christinafliege commented 4 years ago

The config file is the same as the one you helped us with before in the input field, however the data set changed. The example data I was working with was too small and failing at some other steps


  /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamir
raw-data:
  /inputs
reference:
  /projects/bioinformatics/DataPacks/human/gatk_bundle_Oct_2017/gatk_bundle_hg38/Homo_sapiens_assembly38.fasta
population:
  population
input:
  "003-HLH-001_all_lanes_merged":
    - 003-HLH-001_all_lanes_merged.sorted.realigned.bam
  "003-HLH-003_all_lanes_merged":
    - 003-HLH-003_all_lanes_merged.sorted.realigned.bam
  "003-HLH-004_all_lanes_merged":
    - 003-HLH-004_all_lanes_merged.sorted.realigned.bam
centromeres:
  /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamir/inputs/centro.meres
analysis-base:
  analysis
f0t1h commented 4 years ago

Can you try the small example given in the README and tell us if it finishes successfully?

curl -L https://ndownloader.figshare.com/files/22813988 --output example.tar.gz
tar xzvf example.tar.gz
cd example
chmod +x configure.sh
./configure.sh
pamir.sh -j16 --configfile config.yaml

This will help us determine root cause of the problem.

Thanks

christinafliege commented 4 years ago

Thank you.

When running the data set as shown above I get the following error.


Error in rule minia_all:
    jobid: 87
    output: /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamir/inputs/example/analysis/small-pop/002-minia/contigs.fasta
    shell:
        cd /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamir/inputs/example/analysis/small-pop/002-minia/ && minia  -verbose 0 -in /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamir/inputs/example/analysis/small-pop/002-minia/reads.fofn -kmer-size 64 -abundance-min 5 -max-memory 250000 -nb-cores 16 && mv reads.contigs.fa contigs.fasta
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /usr/local/apps/bioapps/pamir/pamir-2.1.0/pamir/.snakemake/log/2020-06-02T095928.686689.snakemake.log
f0t1h commented 4 years ago

It is a bug with setting maximum memory for minia. @joshfactorial can you switch to the master branch and have a clean installation of the pamir and run the small example again?

joshfactorial commented 4 years ago

I got this message after installing:

$ pamir.sh
Running Pamir (f49e0b8)
snakemake(5.17.0)... Failed, requiring == 5.9.1
f0t1h commented 4 years ago

We check the version requirements from our conda environment.yaml. Some of these requirements are strictly required for the intended behaviour of pamir. For Snakemake it should also work on future versions. Temporarily can you update snakemake line in the environment.yaml to ">=" and make clean and make again, it will pass that check.

If you are interested, we can set up a zoom call to help you setup the pamir.

fhach commented 4 years ago

@christinafliege and @joshfactorial, Any updates from your end?

joshfactorial commented 4 years ago

The fix above worked. We're just running into trouble at the moment installing RepeatMasker on our cluster. I'm trying to work through some of the perl issues. Once we get that up and running, I can let you know how if we hit any other Pamir issues.

joshfactorial commented 4 years ago

Okay, I think RepeatMasker installed correctly, however, when I run pamir, this is the result:

$ pamir.sh
Running Pamir (f49e0b8)
snakemake(5.17.0)... OK
samtools(1.9)... OK
bedtools(2.29.2)... OK
mrsfast(3.4.1)... OK
bwa(0.7.17)... OK
repeatmasker(4.1.0)... OK, Version not checked.

where it simply hangs.

f0t1h commented 4 years ago

did you provide the cluster config while running pamir? Like following

pamir.sh --configfile [Config-Path] -j [Number-Of-Cores]
joshfactorial commented 4 years ago

Okay, so we've run into a number of problems running it with a config.

  1. Snakemake tries to create the .snakemake folder for logs in the installation folder, which means multiple users can't run it at the same time. Is there a way we could declare which folder to create those logs?
  2. Pamir seems to get stuck after bwa when we run it.
  3. When we run it on a head node, there are no errors (but it doesn't complete successfully), but on a compute node we get this:
    Processing partitions between 1 and 286 with 15 threads
    terminate called after throwing an instance of 'std::bad_alloc'
    terminate called after throwing an instance of 'terminate called after throwing an instance of '  what():  terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::bad_allocterminate called after throwing an instance of 'std::bad_allocterminate called after throwing an instance of 'std::bad_allocstd::bad_allocstd::bad_alloc'
    '
    terminate called after throwing an instance of 'std::bad_alloc'
    '
    '
    std::bad_alloc  what():    what():  std::bad_alloc'
    terminate called after throwing an instance of '  what():    what():  std::bad_alloc  what():  std::bad_alloc'
    std::bad_allocstd::bad_allocstd::bad_alloc  what():  std::bad_alloc
joshfactorial commented 4 years ago

Attached is the full log LudasLog.txt

joshfactorial commented 4 years ago

A quick update. The bwa issue (2) turned out to be a problem on our end.

f0t1h commented 4 years ago

Does bad_alloc error (3) still persist?

joshfactorial commented 4 years ago

As far as that goes, we think we have a way to run it, but we need to do a little more testing. Basically, we have to copy out pamir.sh to the run folder, delete the -d option that tries to create the log within the installation directory, and run it there. @christinafliege is going to try to do that today (or soon anyway) at some point, with the correct configuration files and hopefully get it working. We did have a successful completion using this method over the weekend on some test data.

christinafliege commented 4 years ago

After copying out pamir.sh to a run folder and editing it removing the -d option. It successfully ran on the example data for multiple users!

However, when running on our input data we are receiving the same error as listed above. Although this time the error log says that it is 13% done instead of 5%. The SAM file that it is creating does not look like it is truncated. Elsewhere in the error log it states that "Big Queue is not cleared", although we are uncertain if that is necessary. The config file for generating this error is the same as the original config file above. I have restarted the job and it picked up where it left off but generated the same error.

Thank you!


[W::sam_read1] Parse error at line 25369458
samtools sort: truncated file. Aborting
[Tue Jun  9 09:15:11 2020]
Error in rule sam_sort:
    jobid: 93
    output: /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sorted.sam
    shell:
        samtools sort  /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sam -m 8G -@ 1 -O SAM -o /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sorted.sam
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
christinafliege commented 4 years ago

additionally. we are using -j16 for our cluster configuration in the qsub script and running on a queue with 2 * Intel Xeon E5 2690 v3 CPUs (24 cores/node) and 256 GB of RAM

Thanks!

f0t1h commented 4 years ago

Can you run so we can see what is wrong with that line? Thank you.

cat /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sam | head -n 25369458 | tail -n1
christinafliege commented 4 years ago

Here is the output. Thanks!


R0230412_0160:8:1312:18333:50748#0/1    16      HLA-B*27:24     HLA01504        796     255     100M    *       0      0CTACGATGGCAAGGATTACATCGCCCTGAACGAGGACCTGCACTCCTGGACCGCCGCGAACACAGCGGCTCAGATCTCCCAGCACAAGTGGGAAGCGGAC    ?22BB@4>:>4((DDA@8@BB<<8+(??82@AA@A8BBAB<3???>>9;3;/8A4;(BD?D?6EFFFECGG>GFIGAFIEEHC:@GBFBDFF@:@A?8:=    NM:i:7  MD:Z:6C34G16G4G19G9G4C1
fhach commented 4 years ago

@christinafliege have you installed mrsfast throug bioconda or directly from github?

christinafliege commented 4 years ago

@joshfactorial can you answer this?

joshfactorial commented 4 years ago

We installed it directly from github.

fhach commented 4 years ago

@joshfactorial please update the mrsfast version to latest v3.4.2 from github. This will fix the issue from samtools truncation error. You also need to re-build the mrsfast index after updating the mrsfast version. You should update the version in eviroment.yaml to 3.4.2 so the script can pass the version check. As always, please make clean && make

fhach commented 4 years ago

@joshfactorial Most of these changes (multiple user snakemake; version updates) are reflected in the current master. I would encourage you to do a full clone.

fhach commented 4 years ago

@christinafliege after @joshfactorial does the updates, you need to either remove /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/ or full analysis folder. Since there will be index change, I am leaning to suggest you to remove the full analysis folder.

fhach commented 4 years ago

@christinafliege If everything goes smoothly, we can close this thread.

christinafliege commented 4 years ago

@fhach, after @joshfactorial did the reinstall, I deleted the entire PamirAnalysis folder and started the job up again. It ran for 15 hours before erroring with the same message.


samtools sort: truncated file. Aborting
[Fri Jun 12 03:31:19 2020]
Error in rule sam_sort:
    jobid: 93
    output: /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sorted.sam
    shell:
        samtools sort  /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sam -m 8G -@ 1 -O SAM -o /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sorted.sam
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
fhach commented 4 years ago

My initial guess would be that you have not rebuilt the mrsfast index after obtaining v3.4.2.

Regardless, that truncation error should have a line number. Can you use @f0t1h line to extract that line from the sam file?

christinafliege commented 4 years ago

I'm sorry somehow that didn't get copied out of the error file

[W::sam_read1] Parse error at line 25369466

joshfactorial commented 4 years ago

Okay, just so we're clear we need to run this command:

./mrsfast --index genome.fa

?

christinafliege commented 4 years ago

cat /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sam | head -n 25369458 | tail -n1

R0230412_0160:5:2110:4957:26090#0/1     16      chrUn_JTFH01001998v1_decoy      955     255     100M    *       0      0                                                                                                                    TATAATACATGCTTTGGGTACTTTGATATTTTTTGTACAGTATAGAATATATACCTTGGGTACTTTGATATTTTATGTACAGTATATAATATATAGTTTG     EEFEFECFFFFFHHHHHIHEJJJIJJJGGJJJJJJJIJJJJJJJJJIJIIHIGJJIIJIHGHIIJJIIJJJJIIIJICJJJJJJJJJHHHHHFFFFFCCC                                NM:i:7   MD:Z:10A21A4G11C4T36C3C4
fhach commented 4 years ago

@christinafliege in head -n, replace the number with 25369466.

fhach commented 4 years ago

@joshfactorial, yes, that is the correct.

christinafliege commented 4 years ago

 cat /projects/mgc/Project_2/HLHS_BasilAniseVC/Pamiranalysis/population/005-pamir-oea-processing/003-HLH-004_all_lanes_merged/003-HLH-004_all_lanes_merged.anchor.sam | head -n 25369466 | tail -n1
R0230412_0160:8:1312:18333:50748#0/1    16      HLA-B*27:24     HLA01504        796     255     100M    *       0      0                                                                                                                    CTACGATGGCAAGGATTACATCGCCCTGAACGAGGACCTGCACTCCTGGACCGCCGCGAACACAGCGGCTCAGATCTCCCAGCACAAGTGGGAAGCGGAC     ?22BB@4>:>4((DDA@8@BB<<8+(??82@AA@A8BBAB<3???>>9;3;/8A4;(BD?D?6EFFFECGG>GFIGAFIEEHC:@GBFBDFF@:@A?8:=                                NM:i:7   MD:Z:6C34G16G4G19G9G4C1
fhach commented 4 years ago

@christinafliege index should be rebuilt. The issue is coming from comments in reference fasta that has been seperated by TAB, mrsfast's new index should take care of the comments seprated by TAB.

When @joshfactorial is finished indexing, please delete foldiers starting with 005, 006, ... 012. Only keep the folders starting with 001,002,003,004 and you can rerun. It will resume from stage 005.

joshfactorial commented 4 years ago

We're re-running now.

christinafliege commented 4 years ago

While still running, it looks like this fixed the current error. Thanks!

fhach commented 4 years ago

@christinafliege if you have a complete run on your data, would you please close this issue.

christinafliege commented 4 years ago

the data is still running. Currently at "rule pamir_assemble_full_new:" for the past 46 hours! I will close it when it is complete! Thanks! :)