mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
178 stars 16 forks source link

Run iris report error(nvalid read names field) #12

Closed samll-rookie closed 3 years ago

samll-rookie commented 3 years ago

I want to understand what caused the problem. The error is as follows:

Skipping Bomo_Chr1:59811:INS:4 because of invalid read names field: (time = 00:00:00:04.165) Skipping Bomo_Chr1:71777:INS:5 because of invalid read names field: (time = 00:00:00:04.173)

The command is: jasmine --output_genotypes file_list=head5.vcf.list out_file=head5.vcf genome_file=genome_assembly.fa --dup_to_ins samtools_path=/export2/software/Bases/samtools/v1.4/bin/samtools ---run_iris bam_list=head5.bam.list out_dir=test02_iris > log 2>&1

mkirsche commented 3 years ago

Hi,

Running Iris requires the RNAMES field to be present (it then uses the reads listed there to perform variant polishing). Could you please let me know how the VCF is being generated? If running Sniffles the parameter -n -1 is needed (see here: https://github.com/mkirsche/Iris).

Best, Melanie

samll-rookie commented 3 years ago

Hi,

Running Iris requires the RNAMES field to be present (it then uses the reads listed there to perform variant polishing). Could you please let me know how the VCF is being generated? If running Sniffles the parameter -n -1 is needed (see here: https://github.com/mkirsche/Iris).

Best, Melanie

samll-rookie commented 3 years ago

Thank you for your prompt reply,. my running command is:sniffles-core-1.0.12/sniffles --mapped_reads A1.bam --vcf A1.vcf --min_support 2 --threads 12

When I want to run iris later, do sniffles need to add any parameters ?

mkirsche commented 3 years ago

Hi,

Other than the missing "-n -1" parameter I mentioned to output the RNAMES field, everything else looks great!

Best, Melanie

samll-rookie commented 3 years ago

Thanks for your suggestion. I have another problem. My compute node has 500G memory, and an error shows that the memory is insufficient. The details are as follows: There is a better suggestion without increasing the memory

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.base/java.lang.StringUTF16.compress(StringUTF16.java:160) at java.base/java.lang.String.(String.java:3214) at java.base/java.lang.String.(String.java:276) at java.base/java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:602) at java.base/java.nio.CharBuffer.toString(CharBuffer.java:1403) at java.base/java.util.regex.Matcher.toMatchResult(Matcher.java:274) at java.base/java.util.Scanner.match(Scanner.java:1399) at java.base/java.util.Scanner.nextLine(Scanner.java:1652) at VariantInput.getSingleList(VariantInput.java:72) at VariantInput.readAllFiles(VariantInput.java:38) at Main.runJasmine(Main.java:61) at Main.main(Main.java:22)

mkirsche commented 3 years ago

Hi, it seems highly unlikely that it would use that much memory, so I'm guessing it's something with the way Java is set up on your machine. Could you please try adding -Xmx40g to the Java invocation in the Jasmine executable? And if you watch the process on e.g. the output of "top" does it appear to be using 500 gb, or crashing at some other limit?

samll-rookie commented 3 years ago

Okay, Thanks I also guess that the java default parameters are too small. I am currently testing to java -Xmx40g. I saw the error in the log, but did not see 500G exit in the analysis.

mkirsche commented 3 years ago

Thanks a lot for checking that! If it works on your test, I'll plan to make that a configurable parameter in the Jasmine script for the next release so that it won't be necessary to manually edit the executable like that.

Best, Melanie


From: renpp notifications@github.com Sent: Sunday, January 24, 2021 9:13:39 PM To: mkirsche/Jasmine Jasmine@noreply.github.com Cc: Melanie Kirsche melaniekirsche@jhu.edu; Comment comment@noreply.github.com Subject: Re: [mkirsche/Jasmine] Run iris report error(nvalid read names field) (#12)

  External Email - Use Caution

Okay, Thanks I also guess that the java default parameters are too small. I am currently testing to java -Xmx40g. I saw the error in the log, but did not see 500G exit in the analysis.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmkirsche%2FJasmine%2Fissues%2F12%23issuecomment-766495163&data=04%7C01%7Cmelaniekirsche%40jhu.edu%7C539000fe301f45b54b7508d8c0d6d56b%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637471376229923341%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MlvAIhxr%2FP4m7iscTSV8EXZrJTexOa0mT72Ay8gY3Xc%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACIYVSTLM7VEXAEXJCAYNFDS3THVHANCNFSM4WMM2WVQ&data=04%7C01%7Cmelaniekirsche%40jhu.edu%7C539000fe301f45b54b7508d8c0d6d56b%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637471376229933327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dguxPw1fYlf0DWr3QvtdT0IAQIHlSEmS52rJ88cxxVI%3D&reserved=0.

samll-rookie commented 3 years ago

I added -Xmx200g to java -Xmx200g -cp, but the memory is still overflowing, which is inconsistent with the previous memory error. I re-added the memory to Xmx300g for testing. Is the plan reasonable? The error is as follows:

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00002b0d9e000000, 34527510528, 0) failed; error='Not enough space' (errno=12) here is insufficient memory for the Java Runtime Environment to continue. Native memory allocation (mmap) failed to map 34527510528 bytes for committing reserved memory. An error report file with more information is saved as: /export2/master2/renpp/project/jiacan_genome/06_submit_197_87_117_155/05.SURVIVOR_merge/hs_err_pid26005.log

mkirsche commented 3 years ago

Hi,

Thanks for looking into that! Did you also get that error with the 40G run? I'd be worried that with 300G, esepcially with multiple threads, you'd risk allocating more memory than is available to you. Could you please also give me more details onthe size of the dataset you are working with? How many VCFs are listed in head5.vcf.list,and about how many variants does each of them contain?

mkirsche commented 3 years ago

Could you please also let me know the output of running this on your machine (to see the Java defaults)?

java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize'

Thank you!

samll-rookie commented 3 years ago

Question 1 : Yes, I used 40 memory to run and reported an error. I added 300G to the same place, but the memory overflowed after running. I have 500 samples, and the SV of each sample is around 150,000 .

Question 2 : java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize' intx CompilerThreadStackSize = 1024 {pd product} {default} size_t ErgoHeapSizeLimit = 0 {product} {default} size_t HeapSizePerGCThread = 43620760 {product} {default} size_t InitialHeapSize = 2147483648 {product} {ergonomic} size_t LargePageHeapSizeThreshold = 134217728 {product} {default} size_t MaxHeapSize = 32178700288 {product} {ergonomic} uintx NonNMethodCodeHeapSize = 8178940 {pd product} {ergonomic} uintx NonProfiledCodeHeapSize = 121739650 {pd product} {ergonomic} uintx ProfiledCodeHeapSize = 121739650 {pd product} {ergonomic} intx ThreadStackSize = 1024 {pd product} {default} intx VMThreadStackSize = 1024 {pd product} {default} openjdk version "11.0.1" 2018-10-16 LTS OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)

Question 3: Currently using -Xmx500g test, the top command shows that 95% of the memory has been used .I guess the memory may still be insufficient.

mkirsche commented 3 years ago

Hi,

Thank you for the fast reply! It does sound like from your Xmx500g test that it is using all of the memory on your machine. Your dataset is quite large, but I'm not sure how it's using that much memory so it may be an error in the code somewhere. In our experience the memory used is about 1.5GB per million variants. Is the Jasmine command for these latest experiments the same one you were running in your first message? And assuming your 500g run ran out of memory, did it crash in the same place (based on the stack trace) as it was crashing in your earlier experiements?

Another thing you could try is to run Jasmine on a subset of the samples (maybe 10 to start with?) to see if it works and/or how much memory it uses. Depending on what happens you have a few options to reduce the memory at a slight cost to runtime: 1) Divide your data into subsets (perhaps of 10 or 15 samples each, but this depends on what can fit in memory). Then merge each of these subsets and merge the resulting VCFs from all of the subsets into a final VCF. The SUPP_VEC_EXT field will be able to capture sample presence across multiple merges, and by merging in groups the memory should be reduced substantially. Or 2) Break up the variants by chromosome for each sample, and merge within each chromosome, combining the data back at the end. This would require that no single chromosome has too much data for the amount of memory available.

In the meantime, if it would be possible for you to share any of the data (either a representative sample's VCF or even just a few lines from one of them), I would be happy to help look into what might be causing the memory footprint to be so large. One thing that comes to mind is if the variant IDs are especially long, but they'd have to be thousands of characters for it to use up so much memory while just reading in the input.

Thanks so much for your patience and interest in using Jasmine! Melanie