mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
174 stars 16 forks source link

changing number of threads from 2 to 4 or 8 #18

Closed santoshatanur closed 3 years ago

santoshatanur commented 3 years ago

Hi I am to change number of threads to 4 or 8 using threads=4 or threads=8 option but jasmine is using default 2 threads only. threads is not overriding default option. My machine has 16GB memory per processor, to increase memory in need to use more threads but even when I use 4 threads or 8 threads and memory accordingly 64GB or 128GB, Jasmine fails as soon as it reaches 32Gb memory. It never uses addition memory even when it is available. Could you please help me with correct way to use threads option or how to increase memory usage.

mkirsche commented 3 years ago

Hi,

Jasmine's multi-threading is only used for speeding up the merging process, and regardless of the number of threads you are using, the entire set of variants still needs be able to fit into memory on a single thread. One workaround you could try is splitting up your VCFs by chromosome and then starting separate instances of Jasmine on different processors for each set of variants. As for why you're not seeing more than two threads used, it's possible that Jasmine is crashing while reading the input and before it gets to the multi-threaded merging part of its execution.

I hope that answers your questions! Melanie

santoshatanur commented 3 years ago

Hi Melanie

Thanks for your quick response. I was planning to use same option of splitting file by chromosomes and merging each chromosome on separate processor, however, I have more than 22,000 samples. It would be very messy. Also, if files are separated by chromosome, how will it handle translocations?

I tried merging 1000 files at a time and then merging 22 files. Merging 1000 files works fine but merging 22 merged files fails due to out of memory error.

Regards, Santosh


From: Melanie Kirsche @.> Sent: 28 June 2021 16:20 To: mkirsche/Jasmine @.> Cc: Atanur, Santosh @.>; Author @.> Subject: Re: [mkirsche/Jasmine] changing number of threads from 2 to 4 or 8 (#18)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Hi,

Jasmine's multi-threading is only used for speeding up the merging process, and regardless of the number of threads you are using, the entire set of variants still needs be able to fit into memory on a single thread. One workaround you could try is splitting up your VCFs by chromosome and then starting separate instances of Jasmine on different processors for each set of variants. As for why you're not seeing more than two threads used, it's possible that Jasmine is crashing while reading the input and before it gets to the multi-threaded merging part of its execution.

I hope that answers your questions! Melanie

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mkirsche/Jasmine/issues/18#issuecomment-869776103, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AR4O47DUMM7YU65CRL3RLETTVCHNRANCNFSM47MS6TCA.

mkirsche commented 3 years ago

Hi Santosh,

Thanks for the more detailed information! It's good to know that the 1000-sample merges are working okay, but since the memory is linear in the total size of all input VCFs, I would expect that final merge to be similarly memory-intensive to merging all 22,000 at once. Now that there are fewer files, though, it is hopefully more manageable to separate them by chromosome before calling Jasmine. That's a good point about translocations though; there are cases where the same translocation may be listed as starting on chr1 and ending on chr2 in one sample, but be listed in the opposite way in another sample. Fortunately translocation calls are typically rarer than other SV types, so it should be sufficient to process all translocation calls together when calling Jasmine, and then separate the rest of the SVs by chromosome.

Best, Melanie

santoshatanur commented 3 years ago

Hi Melanie

I tried splitting files by chromosome and then merging. It didn't work either. I am using --output_genotype option is that causing memory issues? I observed that output file doesn't provide information about allele number (AN) and allele frequency (AF). Is there any option that allow writing AF and AN in INFO field?

Regards Santosh


From: Melanie Kirsche @.> Sent: 28 June 2021 16:20 To: mkirsche/Jasmine @.> Cc: Atanur, Santosh @.>; Author @.> Subject: Re: [mkirsche/Jasmine] changing number of threads from 2 to 4 or 8 (#18)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Hi,

Jasmine's multi-threading is only used for speeding up the merging process, and regardless of the number of threads you are using, the entire set of variants still needs be able to fit into memory on a single thread. One workaround you could try is splitting up your VCFs by chromosome and then starting separate instances of Jasmine on different processors for each set of variants. As for why you're not seeing more than two threads used, it's possible that Jasmine is crashing while reading the input and before it gets to the multi-threaded merging part of its execution.

I hope that answers your questions! Melanie

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mkirsche/Jasmine/issues/18#issuecomment-869776103, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AR4O47DUMM7YU65CRL3RLETTVCHNRANCNFSM47MS6TCA.

mkirsche commented 3 years ago

Hi Santosh,

No, the addition of genotypes uses less memory than the main merging step, so shouldn't be causing the problem.

As for allele frequency and allele number, the information is there it's stored a little bit differently to give information about which samples each variant is present in and to preserve information across multiple rounds of merging. The SUPP_VEC INFO field is a binary string with a 1 or 0 for each sample corresponding to the variant being present or absent in it. SUPP_VEC_EXT is similar, but preserves information across multiple round of merging, so if you merged batches of 1000 samples and then merged those 22 together, the SUPP_VEC field in the final output would have just 22 "samples" while SUPP_VEC_EXT would have 22,000. This SUPP_VEC_EXT field can be disabled with --ignore_merged_inputs to keep the size of the output file lower if a length-22000 INFO field in each variant is prohibitively large, but disabling it shouldn't affect the memory usage of running Jasmine at all. The SUPP and SUPP_EXT INFO fields also give the number of samples the variant is present in.

Melanie

santoshatanur commented 3 years ago

Thanks for the explanation about AF and AC.

But main problem of merging 22000 samples still remains unsolved even after dividing files by chromosome.

Regards Santosh


From: Melanie Kirsche @.> Sent: 29 June 2021 16:26 To: mkirsche/Jasmine @.> Cc: Atanur, Santosh @.>; Author @.> Subject: Re: [mkirsche/Jasmine] changing number of threads from 2 to 4 or 8 (#18)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Hi Santosh,

No, the addition of genotypes uses less memory than the main merging step, so shouldn't be causing the problem.

As for allele frequency and allele number, the information is there it's stored a little bit differently to give information about which samples each variant is present in and to preserve information across multiple rounds of merging. The SUPP_VEC INFO field is a binary string with a 1 or 0 for each sample corresponding to the variant being present or absent in it. SUPP_VEC_EXT is similar, but preserves information across multiple round of merging, so if you merged batches of 1000 samples and then merged those 22 together, the SUPP_VEC field in the final output would have just 22 "samples" while SUPP_VEC_EXT would have 22,000. This SUPP_VEC_EXT field can be disabled with --ignore_merged_inputs to keep the size of the output file lower if a length-22000 INFO field in each variant is prohibitively large, but disabling it shouldn't affect the memory usage of running Jasmine at all. The SUPP and SUPP_EXT INFO fields also give the number of samples the variant is present in.

Melanie

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mkirsche/Jasmine/issues/18#issuecomment-870697610, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AR4O47FC7C7LH2BB4LJNZJTTVHQ3HANCNFSM47MS6TCA.

mkirsche commented 3 years ago

Yes, that's true. Were any of the chromosomes able to be merged correctly? If so, and if it's only the larger chromosomes, here are a few more ways you could try to merge variants on the remaining chromosome with limited memory:

mkirsche commented 3 years ago

Hi Santosh,

To follow up on my last message, I have just added a utility script split_jasmine (https://github.com/mkirsche/Jasmine/commit/7fb1bb256d78be9baeba419939d4235cc55ceb9c) to help automate some of the suggestions above. It will split each VCF by chromosome and SV type, and has the option to also split by genomic position along each chromosome at fixed intervals. It then writes to standard out a list of split VCF lists which can be passed to Jasmine as follows:

./split_jasmine file_list=filelist.txt output_dir=split segment_length=50m > all.txt
for i in `cat all.txt`; do jasmine file_list=$i out_file=$i.merged.vcf; done # Execute this loop in parallel if possible

I expect to include this script in the next release so it will be available in the conda Jasmine installation at that time, but in the meantime if you build Jasmine from the Github source you can run it with the command /path/to/jasmine_repo/split_jasmine and see the usage instructions by running it with no parameters. As I mentioned before, the merging will be disrupted at any breakpoints that are introduced based on the optional segment_length parameter, so you will want to make the segment lengths as long as possible while still being able to fit in the memory that you have available. I would also recommend doing this splitting on the 22 merged files you have instead of all 22,000 to reduce the number of intermediate files produced.

I hope that helps! Melanie

santoshatanur commented 3 years ago

Hi Melanie

Thanks for sending this. However, I would like to point out few things

1) All the chromosomes failed to merge irrespective of length of the chromosome. Even for chrUn (combination of all unmapped contigs), which is quite small, jasmine did not work. 2) I tried to merge 22 VCF files (each containing 1000 sample and all chromosomes) without genotype option, it did work. So, memory issue is not due to number of variants but because of number of samples/genotypes. 3) The SUPP_VEC_EXT or SUPP_VEC cannot be used to derive allele frequencies (AF) because it only mentions presence (1) and absence (0) of variant but does not mention if it is heterozygous variant or homozygous alternate. AF is number of alternate alleles/total number of alleles. For homozygous variant allele count should be 2, for het it's 1. Total number of alleles is number of samples x 2 as each individual has two copies of every chromosome (except for sex chromosome in males). 4) So, to calculate AF genotype information is essential.

Regards, Santosh


From: Melanie Kirsche @.> Sent: 29 June 2021 21:23 To: mkirsche/Jasmine @.> Cc: Atanur, Santosh @.>; Author @.> Subject: Re: [mkirsche/Jasmine] changing number of threads from 2 to 4 or 8 (#18)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Hi Santosh,

To follow up on my last message, I have just added a utility script split_jasmine (7fb1bb2https://github.com/mkirsche/Jasmine/commit/7fb1bb256d78be9baeba419939d4235cc55ceb9c) to help automate some of the suggestions above. It will split each VCF by chromosome and SV type, and has the option to also split by genomic position along each chromosome at fixed intervals. It then writes to standard out a list of split VCF lists which can be passed to Jasmine as follows:

./split_jasmine file_list=filelist.txt output_dir=split segment_length=50m > all.txt for i in cat all.txt; do jasmine file_list=$i out_file=$i.merged.vcf; done # Execute this loop in parallel if possible

I expect to include this script in the next release so it will be available in the conda Jasmine installation at that time, but in the meantime if you build Jasmine from the Github source you can run it with the command /path/to/jasmine_repo/split_jasmine and see the usage instructions by running it with no parameters. As I mentioned before, the merging will be disrupted at any breakpoints that are introduced based on the optional segment_length parameter, so you will want to make the segment lengths as long as possible while still being able to fit in the memory that you have available. I would also recommend doing this splitting on the 22 merged files you have instead of all 22,000 to reduce the number of intermediate files produced.

I hope that helps! Melanie

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mkirsche/Jasmine/issues/18#issuecomment-870891059, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AR4O47F5Q5IXYIKAGLJOST3TVITVFANCNFSM47MS6TCA.

mkirsche commented 3 years ago

Hi Santosh,

That's good to know that it's failing only with the genotypes options, and regardless of the number of variants. Do you have the error message from when it failed? It would be helpful in tracking down the reason for it running out of memory.

Thanks and sorry for all the trouble, Melanie

santoshatanur commented 3 years ago

Hi Melanie

Here is the error for one of run.

Santosh


From: Melanie Kirsche @.> Sent: 30 June 2021 13:58 To: mkirsche/Jasmine @.> Cc: Atanur, Santosh @.>; Author @.> Subject: Re: [mkirsche/Jasmine] changing number of threads from 2 to 4 or 8 (#18)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Hi Santosh,

That's good to know that it's failing only with the genotypes options, and regardless of the number of variants. Do you have the error message from when it failed? It would be helpful in tracking down the reason for it running out of memory.

Thanks and sorry for all the trouble, Melanie

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mkirsche/Jasmine/issues/18#issuecomment-871381729, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AR4O47BUFZJBP6FBSYZNEB3TVMIH3ANCNFSM47MS6TCA.

mkirsche commented 3 years ago

Hi Santosh,

I'm not seeing any error message in or attached to your reply - did Github maybe reject a file upload?

Thanks, Melanie

santoshatanur commented 3 years ago

Screenshot 2021-06-30 at 14 20 30

mkirsche commented 3 years ago

Hi Santosh,

Thanks for providing that! I think I have identified the issue - one of the GT fields (the previous merge's SUPP_VEC field, which is now unnecessary due to the more recent addition of the SUPP_VEC_EXT INFO field) was causing the memory to scale poorly with the number of samples. I have addressed this plus a few smaller issues in the most recent commit (https://github.com/mkirsche/Jasmine/commit/0120f7a18a7b47e70118e17bd76a4a1d2137b534). Could you please run this version on your data and let me now if it fixes your issue?

Thanks, Melanie

santoshatanur commented 3 years ago

I downloaded most recent commit as you suggeted but still got following error. Did I download correct commit... Screenshot 2021-07-01 at 19 07 17

mkirsche commented 3 years ago

It looks like that's using the old version based on the line numbers (line 322 of AddGenotypes.java no longer contains the call to VcfEntry.getReadSupport). The updated version v1.1.2 is currently under review with bioconda though (https://github.com/bioconda/bioconda-recipes/pull/29366), so should be available through there later today.

santoshatanur commented 3 years ago

I have downloaded jasmine version 1.1.2 and rerun the merging 22 VCF files. Still getting OutOfMemory error. I would like to mention that I can see merged SV file without genotype in temp folder where intermediate files are written. So code is running fine till merging SVs. It runs out of memory only when It try to write genotypes. This is my last try, I am giving up now... Screenshot 2021-07-02 at 08 41 11

mkirsche commented 3 years ago

Hi Santosh,

I'm sorry to hear you still got the out of memory error, and for all of the frustration.

It looks like that error occurred before the addition of genotypes even began (in the middle of combining VCF entries and writing the ungenotyped file), and so would be unaffected by the changes I made. Given the error you got, I would expect that the merged no-genotype file is from the previous run since the place this latest run crashed is before the place where the no-genotype file is moved to the output directory.

But since you have that file, it is possible to run the genotype output in isolation, which should be less memory-intensive than rerunning the entire merging:

# First move the no-genotypes file outside of the output directory
jasmine file_list=/path/to/listof22vcfs.txt out_file=/path/to/ungenotyped.vcf --postprocess_only --output_genotypes 

That being said, I do understand if you'd rather not keep trying to get it work since running into these kinds of issues is always frustrating. I'm happy to help address any problems you might encounter if you do give it another try though.

Melanie

santoshatanur commented 3 years ago

Did as you suggested, still got same error Screenshot 2021-07-02 at 20 34 10

santoshatanur commented 3 years ago

Thanks for all your help. It works when I split files in to smaller chunks.... chromosomes thata are 100mb or smaller it works with entire chromosome. larger chromosomes need to be split in 100mb chunks to merge 22 vcfs(1000 samples per vcf). Thanks once again

mkirsche commented 3 years ago

Awesome! I'm glad to hear it finally worked for you!

Melanie