kehrlab / PopIns2

Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs
GNU General Public License v2.0
24 stars 4 forks source link

An error occurred using Popins2 merge #39

Closed abcyulongwang closed 1 year ago

abcyulongwang commented 1 year ago

Dear developer!

I used 1085 resequenced individuals to extract non-reference sequences through Popins2 assemble, but when I use Popins2 merge, it shows "Segmentation fault (core dumped) popins2 merge -t 30 -y pan_graph.gfa -z pan_graph.bfg_colors",The same result occurs when I use the -r parameter or build BFG_COLORS. Maybe this program requires too much memory and I'd like to know if there is any way I can run the program other than reducing resequencing sample. I want to know if there is any way to solve this problem and hope to get your reply.

Sincerely yulong

Krannich479 commented 1 year ago

Dear Yulong, thanks for using PopIns2! I am happy to assist you in finding out what is breaking your program execution.

First, the Segmentation fault (core dumped) error very likely does not originate from a lack of main memory. My experience with previous observations of this error is that it is typically a problem with a compilation unit of PopIns2. Can you please verify, whether the following conditions were satisfied during the installation process:

  1. For the time being, please use a Bifrost version prior to April 22nd, 2022, e.g. git clone this commit of Bifrost. This is because the color encoding changed the day after and I didn't verify compatibility with PopIns2 yet.

  2. Did you manually and correctly compile Bifrost according to the PopIns2 specifications? I.e. it is important to add the MAX_KMER_SIZE=64 when compiling Bifrost and use a maximum k=63 for PopIns2. Please do not install Bifrost via Conda, it will break PopIns2 at run time.

  3. [Only if you work on a cluster] Because you used -t 30 I assume you work on a high-performance computing cluster. If that's the case, please ensure that your execution of popins2 merge runs on a compute node of the same CPU architecture as the compute node that compiled PopIns2. If you cannot guarantee the adherence of this constraint, please erase the --native flag from the compilation of Bifrost (see README) and PopIns2 (erase from Makefile). However, the latter is strongly depreciated as it slows down PopIns2 by orders of magnitude.

  4. Please verify that you used the input parameters of popins2 merge correctly. Using -y and -z is very unusual and is only required if you built a CCDBG via Bifrost outside of PopIns2. In that case you must also adjust the graph's parameters, e.g. k, accordingly. A much easier and more common way to run popins2 merge is to use its input parameters -r or -s. Using -r with the directory to all the samples is usually sufficient. Please see the Popins2 merge usage description or find an example here.

Finally, I am glad you chose PopIns2 for this task! Your scenario of a very large number of samples is precisely what PopIns2 was built for. At this point I don't think you need to reduce the number of samples. In our original publication of PopIns2 we applied the software to 1000 human samples, requiring <3GB of RAM. Main memory is typically not a limiting factor.

abcyulongwang commented 1 year ago

Dear Thomas

Thank you for your prompt and effective reply!

Based on your suggestion, I ran Bifrost using MAX_KMER_SIZE=64, and finally successfully ran "Popins2 merge" and obtained a 500Mb non-reference sequence that did not remove microbial sequences. Popins2 can perfectly use second-generation sequencing data for non-reference sequence extraction and positioning, and it has excellent results for large-scale data processing. I also have 20 PacBio and Nanopore data here. I want to use samtools to extract the non-reference sequences of the bam file and add them to "supercontigs.fa". Because these third-generation non-reference sequences are longer, they may improve the quality of supercontigs. Doing this Will it affect subsequent sequence placement? If this is appropriate, what software do you recommend to achieve this?

I'm very much looking forward to your reply and wish you happiness! sincerely yulong

Krannich479 commented 1 year ago

Dear Yulong, I am glad everything worked out in the end and you like Popins2!

Regarding your research question: In theory, you describe a reasonable way of using Popins2 for long-reads. If I am not overlooking some technical aspect of the downstream modules (placing, genotyping), you should be able to just add sequences to the supercontigs.fa. In fact, I'd be curious to hear if that works, I have never tried this myself. However, I am doubtful that you see a major benefit from placing and genotyping non-reference sequences (NRS) with Popins2 once you discovered them from long-reads. If you detect a set of NRS from reference-aligned long-reads, they are typically very accurate in terms of (1) high sequence quality and (2) precise insertion breakpoints. Popins2 certainly cannot compete with (1), but potentially with (2) depending on the clustering of long-reads during the detection. You might get some refined insertion breakpoints. Also, your idea adds nicely to your existing workflow.

That being said, there is excellent open-source software available for detecting and genotyping NRS from long-reads. For PacBio data I can highly recommend SVDSS >v1.0.5. Its algorithm is perfectly suited for NRS detection. For ONT data you can also try SVIM, Sniffles2 or CuteSV. I know that at least the first two can jointly process multiple individuals.

abcyulongwang commented 1 year ago

Dear Thomas, Thank you for your professional advice, I will test the software you recommend! wish you a happy life