grenaud / gargammel

gargammel is an ancient DNA simulator
GNU General Public License v3.0
24 stars 14 forks source link

high coverage fail #4

Open calkan opened 5 years ago

calkan commented 5 years ago

Hi

I am trying to simulate aDNA data at high coverage. I assume the "-c" parameter sets the overall depth of coverage. Is this correct, or does it set the endogenous coverage? I do this:

./gargammel.pl -c 30 --comp 0.7,0.05,0.25 -l 110 -rl 100 -SS HS25 -o data/70-5-25-40x data/

after quite a long time gargammel fails:

.... Produced 2,147,400,000 ERROR: Cannot add thousandSeparator to non-integer 2147500000 system cmd /mnt/compgen/homes/calkan/projects/ancient/gargammel/src/adptSim -f AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATTCGATCTCGTATGCCGTCTTCTGCTTG -s AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTT -l 100 -artp data/70-5-25-40x_a.fa data/70-5-25-40x_d.fa.gz failed: 256 at ./gargammel.pl line 79.

grenaud commented 5 years ago

I upgraded the isInt to accommodate up to unin64 in my little library libgab. Can you do a:

cd libgab
git status
git pull origin master
make clean 
make 
cd ..
make clean 
make

I hope this will not overflow to more than 4 billion fragments, yes -c is the endogenous coverage.

calkan commented 5 years ago

that problem is now gone, I think. I now have this error with ART though:

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc system cmd /mnt/compgen/homes/calkan/projects/ancient/gargammel/art_src_MountRainier/art_illumina -ss HS25 -amp -na -p -i data/70-5-25-40x_a.fa -l 100 -c 1 -qs 0 -qs2 0 -o data/70-5-25-40x_s failed: 134 at ./gargammel.pl line 79.

calkan commented 5 years ago

ok that is probably because data/70-5-25-40x_s file is 917 GB for some reason. Am I doing this wrong?:

./gargammel.pl -c 30 --comp 0.7,0.05,0.25 -l 110 -rl 100 -SS HS25 -o data/70-5-25-40x data/

what I want to get is a total of 30X human genome coverage with 100 bp paired end reads (fragment 110). That should translate to 900M reads (450M pairs) of length 100bp. Of this data set, 70% should be bacterial, 25% endogenous, 5% present-day contamination. That's what I'm trying to get anyway, but I guess I misinterpret the -c parameter.

grenaud commented 5 years ago

The ART package cannot take zipped files. Hence we have to use plain files.

Can you do an ls -al in the directory data/70-5-25-40x data/

grenaud commented 5 years ago

can you also try to run art on a subset, do you still get the std:bac_alloc?

calkan commented 5 years ago

ls -l data/70-5-25-40x* -rw-rw-r-- 1 calkan compgen 984014037789 Oct 21 22:50 data/70-5-25-40x_a.fa -rw-rw-r-- 1 calkan compgen 91693121260 Oct 20 14:17 data/70-5-25-40x.b.fa.gz -rw-rw-r-- 1 calkan compgen 7782171239 Oct 20 05:07 data/70-5-25-40x.c.fa.gz -rw-rw-r-- 1 calkan compgen 177612870941 Oct 21 08:27 data/70-5-25-40x_d.fa.gz -rw-rw-r-- 1 calkan compgen 78137570969 Oct 20 04:22 data/70-5-25-40x.e.fa.gz -rw-rw-r-- 1 calkan compgen 0 Oct 21 22:50 data/70-5-25-40x_s1.fq

calkan commented 5 years ago

art works well with a small subset, no std:bad_alloc

grenaud commented 5 years ago

thanks! I have emailed the developers, let's wait. In the meantime, maybe you can dice up the input using unix split? Very sorry for the trouble.

On Mon, Oct 22, 2018 at 8:46 PM Can Alkan notifications@github.com wrote:

art works well with a small subset, no std:bad_alloc

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grenaud/gargammel/issues/4#issuecomment-431933114, or mute the thread https://github.com/notifications/unsubscribe-auth/ACEWo0OOuUbBJEfbIp9VSe6_53J4E7n2ks5unhKegaJpZM4Xu2r9 .

calkan commented 5 years ago

ok. there are _b, _c files as well, should I repeat with them? What happens after that, is the ART output the final output?

grenaud commented 5 years ago

normally you just need the _a file. it is the one with the adapter ligated on the deaminated fragments.

ART produces the final output yes.

On Mon, Oct 22, 2018 at 9:12 PM Can Alkan notifications@github.com wrote:

ok. there are _b, _c files as well, should I repeat with them? What happens after that, is the ART output the final output?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grenaud/gargammel/issues/4#issuecomment-431942301, or mute the thread https://github.com/notifications/unsubscribe-auth/ACEWo50xh_kRpqkb5nS83v2_BRruKImuks5unhi0gaJpZM4Xu2r9 .