metaGmetapop / metapop

A pipeline for the macro- and micro-diversity analyses and visualization of metagenomic-derived populations
MIT License
37 stars 10 forks source link

06.SNP_counts step running for an abnormally long time #13

Open ChaoLab opened 2 years ago

ChaoLab commented 2 years ago

I have been running the metapop on the test dataset (https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/MetaPop_Testdata) for several days. While one step - the mpileup step - takes a very long time (69 hrs as you can see here). Is this the problem/bug that someone encountered before? MicrosoftTeams-image I think this step is for 06.SNP_counts MicrosoftTeams-image (1) These three files are just empty

metaGmetapop commented 2 years ago

Are you running it locally or on a server? That is very strange for it to be running for that long.

ChaoLab commented 2 years ago

I am running it on a server. I also tested one samtools run by myself, using samtools mpileup --skip-indels -f ./01.Genomes_and_Genes/all_genomes.fasta ./02.Filtered_Samples/3300042566.viral_species_rep.id90_preprocessed.bam -o samtools.mpileup.output

It just took several minutes with the following results: Screenshot 2022-03-14 204358

Is it due to some minor issues in the script related to producing the right outputs?

Update 2020-03-15

I checked the "top" screen when I am running metapop for the test dataset: Screenshot 2022-03-15 105852 It seems that the output option (-o) for each command was lost? (not quite sure on this) Is this the problem?

metaGmetapop commented 2 years ago

We're looking into this. Could you send us (via emails on homepage of the github) your log file and error file from your run?

KGerhardt commented 2 years ago

Hello.

I apologize that I couldn't get on this sooner. I'm working on this issue.

The python script for the SNP calling section reads the command line output from mpileup (default behavior without -o) and so skips the direct production of an external file with mpileup. Lines should be processed, then printed out to metapop's own, smaller pileup file that summarizes the parts of the pileup that metapop needs.

I have a pretty good guess at what's happening at the mpileup step to cause the empty filtered_SNP files. Data from the VCF calls is passed to parallel processes using a global scope variable. Python's multiprocessing is supposed to pick up any variables in global scope and replicate them to worker processes for parallelization. On my system and on several servers we've run metapop on, it does. It looks like it isn't on yours.

This means that the part of my code looking over the mpileup output might be caught in a limbo where it thinks it has a reference list of SNP sites that it does not, causing the code to hang at that location, both without erroring or allowing the mpileup process to die upon completion of the input file.

I ran into a similar problem in another project I was working on - a collaborator couldn't run the same code I could because of this exact issue. It appears that this is a bug in python's multiprocessing module, as it's system-dependent behavior that the module doesn't handle correctly. It can be fixed by explicitly passing data to the workers instead of relying on the global variable. I'm going to go implement that fix everywhere I see this method of passing data and upload a new version.

ChaoLab commented 2 years ago

Many thanks for your comprehensive explanation on this python multiprocessing bug!

KGerhardt commented 2 years ago

There is a new version of metapop (0.0.54) up on pip that should be available for install.

It incorporates explicit data passing to (hopefully) resolve the issue of indefinite mpileup runtime. I'm kind of surprised that the parallel process did not crash upon being unable to access data, but if my guess about what was causing the issue is correct, it's hard to predict exactly what can happen from system to system. I would appreciate you giving it another try with the update and telling me how it goes.

There was also a bug in locating genes that were supplied manually and were outside of the folder in which metapop normally predicts them that I fixed as well - this was behind the "file not found" issue. It should also be fixed.

ChaoLab commented 2 years ago

Hi, It seems to be the same problem. Metapop was halted in the same samtools mpileup step. I have deleted my old conda, and follow your instructions here (https://github.com/metaGmetapop/metapop#installation) to set up a new conda env and use the new metapop (0.0.54). Screenshot 2022-03-19 085632

KGerhardt commented 2 years ago

Okay. I suppose the next thing to try is for me to make a script that can give me some extended error logs and pass it to you. I'll get on that.

ChaoLab commented 2 years ago

Hi, When previously I was looking the long-halting problems, I found this:

https://github.com/metaGmetapop/metapop/blob/master/metapop/metapop_snp_call.py In line 161, you have defined the out file. While in line 189, did you forget to input the out file into the square brackets?

It looks like you are using "subprocess.Popen" and "stdout=subprocess.PIPE" on line 191 to push the output into the variable "pile_stream". Then in line 219 they are putting "pile_stream" into "out" (file opened on line 186). This looks like a not a so good way of actually coding it in terms of how Python works - quoted from one of my labmates who is familiar with Python.

It is suggested that can it be written in a more easy-going way to avoid potential multiprocessing problems in Python? Hope this info from me and my labmate can help

KGerhardt commented 2 years ago

Line 161 is simply the name of the output file I intend to create. I did not forget to put the output file into the mpileup command - excluding this argument causes mpileup to print to stdout instead, which I'm capturing using the subprocess.PIPE.

I agree with your labmate that, ordinarily, this wouldn't make too much sense. However, in this case it was very much intentional. You can see the loop where printing occurs starting at line 193. That loop includes conditionals that prevent the printing of most of the output from the mpileup call, excluding all positions not already called as a SNP site by the earlier VCFtools call.

I could write the mpileup output to a file using mpileup's file output, then read it into python, do the same filtering, and then write the same results. I felt that it was better to skip this additional I/O step and perform the filtering directly on the captured output.

I'm still not sure why the call is failing for you, though. I've tried running the toy dataset on several platforms, and it's working for me.

In the 05.Variant_Calls subdirectory, do the "...final_called.txt" files contain data?

ChaoLab commented 2 years ago

Yes, 05.Variant_Calls subdir contained right data: Screenshot 2022-03-23 093010

I followed this to make the metapop conda env: https://github.com/metaGmetapop/metapop#installation The Python version in my conda is v3.7.12.

While one of my labmate (another one) made her own metapop conda env, I used her env and could make it through for 06.SNP_counts step. But it ends up with another error report (see the attached file): 02.run.metapop.v2.log

Her metapop conda yml file (remove the ".txt" to use it properly): metapop.yml.txt

KGerhardt commented 2 years ago

Okay, well the VCF data being present is good. That makes a more direct test of the mpileup errors simpler.

Ah. To your coworker's problem, it seems I forgot to update the github with current code. I had just been pushing new versions to pypi and it slipped my mind. That error was a result of fixes we made for another user with respect to a problem coming from an unusual read naming convention.

A pip reinstall of metapop should fix your coworker's issue. I just hope that it doesn't introduce the problem that you're running into.

On Wed, Mar 23, 2022 at 10:40 AM Zhichao Zhou @.***> wrote:

Yes, 05.Variant_Calls subdir contained right data: [image: Screenshot 2022-03-23 093010] https://user-images.githubusercontent.com/35715202/159723212-1d1a37e1-f397-4aa8-9107-c576ca7e8e50.jpg

I followed this to make the metapop conda env: https://github.com/metaGmetapop/metapop#installation The Python version in my conda is v3.7.12.

While one of my labmate (another one) made her own metapop conda env, I used her env and could make it through for 06.SNP_counts step. But it ends up with another error report (see the attached file): 02.run.metapop.v2.log https://github.com/metaGmetapop/metapop/files/8333918/02.run.metapop.v2.log

Her metapop conda yml file (remove the ".txt" to use it properly): metapop.yml.txt https://github.com/metaGmetapop/metapop/files/8333933/metapop.yml.txt

— Reply to this email directly, view it on GitHub https://github.com/metaGmetapop/metapop/issues/13#issuecomment-1076455278, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZXG3OWW2YX6GWVASZPYTTVBMUPLANCNFSM5QUHBZUA . You are receiving this because you commented.Message ID: @.***>

ChaoLab commented 2 years ago

Hi, I reinstalled metapop in conda env and now the metapop is v.0.054 (the previous one was v0.0.42). While, another error report just pop up: run_metapop.log

"linked_snp_results.tsv" in 09.Linked_SNPs only contained header but not any contents. 10.Microdiversity is empty.

liaohu1231 commented 2 years ago

I have meet same error with ChaoLab. The mpileup have runed very long time

KGerhardt commented 2 years ago

I'll have to solve the linked SNPs error issue separately.

Since the VCF call seems to be successfully generating output successfully by issuing output through the command call, let's try that with the mpileup files. I've put up a new version (0.56) that uses the mpileup output option to write the full mpileup output, then cleans those outputs after the mpileup call wraps up.

It also no longer discards the stderr outputs from mpileup, so maybe we'll see if there's a different issue that was hidden along the way.

Can people please also check their versions of samtools? This shouldn't be an issue, but I want to make sure. samtools --version should show it.

liaohu1231 commented 2 years ago

image

KGerhardt commented 2 years ago

Thank you.

On Fri, Mar 25, 2022 at 1:40 PM liaohu1231 @.***> wrote:

[image: image] https://user-images.githubusercontent.com/59506038/160173317-f7fc0cb1-4e78-4c4b-8926-bb1ef1f20dd8.png

— Reply to this email directly, view it on GitHub https://github.com/metaGmetapop/metapop/issues/13#issuecomment-1079255804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZXG3OOLLFE5BNX3N2WOILVBX3ALANCNFSM5QUHBZUA . You are receiving this because you commented.Message ID: @.***>

KGerhardt commented 2 years ago

"linked_snp_results.tsv" in 09.Linked_SNPs only contained header but not any contents.

The linked SNP step uses a similar output capture method on samtools calls to the component that seems to be causing issues with SNP calling. If the change to output files works in the SNP calling step, I'll repeat it with the linked SNPs step.

ChaoLab commented 2 years ago

I'll have to solve the linked SNPs error issue separately.

Since the VCF call seems to be successfully generating output successfully by issuing output through the command call, let's try that with the mpileup files. I've put up a new version (0.56) that uses the mpileup output option to write the full mpileup output, then cleans those outputs after the mpileup call wraps up.

It also no longer discards the stderr outputs from mpileup, so maybe we'll see if there's a different issue that was hidden along the way.

Can people please also check their versions of samtools? This shouldn't be an issue, but I want to make sure. samtools --version should show it.

I encountered the same Samtools long-time halting issue again when using v0.0.56 metapop: Uninstall_old_install_new_metapop My Samtools version is: Screenshot 2022-03-25 173926

It seems to be still not working

KGerhardt commented 2 years ago

Interesting. I changed my samtools version to 1.15 to check if that might be the problem, and now it's replicating the error for me. However, I'm not even able to call the pileup directly on the command line, now. It's just hanging while using CPU resources, with no output.

I'm going to try some testing using these versions of samtools, htslib and bcftools and seeing if there's a fix. It may take me a little bit, as I'm juggling other projects.

ChaoLab commented 2 years ago

Can I know the current Samtools version on your server? I might take a try with the old version. Just a ask

KGerhardt commented 2 years ago

Our server has 1.10 and I had been running 1.7.1. Apparently the more recent versions of samtools had fully deprecated samtools mpileup. It's curious that the call works for you at all when you issue it on the command line.

And of course, it seems that the prior output format is no longer available through any transformation of outputs with the current version of samtools. I'll have to significantly rework this and probably a few other sections to work with current samtools.

ChaoLab commented 2 years ago

Hi, Here is another run of mine using the old Samtools (with MetaPop being v0.0.56) in conda env (Samtools v1.10): Samtools_version

I can make it through to 09.Linked_SNPs, but only with the header in the result file. 10.Microdiversity folder is empty. 09 dir

Here is the log report (Error: "UnboundLocalError: local variable 'new' referenced before assignment"): run_metapop.log

It seem that I can make it through for the Samtools long-halting problems by using the old version as you did. So I am quite sure that the Samtools long-halting problem is due to the version of Samtools we used (me and the other reporter here).

KGerhardt commented 2 years ago

Yes, it appears that a switch to just use the information in BCFtools' version of mpileup is warranted so that it's more version-safe.

I think that the linked SNPs issue will also need resolving by allowing the subprocess call to make an output file and then reading that. I hadn't seen any problems in testing with the capture of subprocess output, but it seems that it's more robust if metapop doesn't attempt to do so. I checked the samtools call in mine reads, and it still works exactly as expected in samtools 1.15, so I don't think that samtools is the problem, there.

ChaoLab commented 2 years ago

So If I understand correctly, the warranted changes will be: 1) use mpileup in BCFtools 2) make subprocess call to make an output file and read that, in replace of the previous method. Right? Looking forward to your new release of metapop!

KGerhardt commented 2 years ago

Yes, that summary is correct.

KGerhardt commented 2 years ago

This is proving to be an interesting problem. I've found what I think is a bug in BCFtool's reporting of SNP sites. It doesn't have the ability to properly record more than one separate allele variant at any given site. Per-site info is combined into single lines by default, and when split, the information from one line is mostly copied to the next, resulting in impossible outcomes.

Here's an example of a line called with bcftools call module and then split with bcftool's norm module to list one variant per line (I'm only including the relevant parts, some info is deleted):

Kang_0019_ST001_A_NODE_13_length_59981_cov_340.027517 46187 . A G DP=249;DP4=3,1,135,60

Kang_0019_ST001_A_NODE_13_length_59981_cov_340.027517 46187 . A C DP=249;DP4=3,1,135,60

The same genome in the same locus (46187) with reference base A and alternates G and C. The DP= flag indicates depth of all reads covering the position and the DP4 shows high-quality bases mapping to ref-forward strand, ref-reverse, alt-forward, alt-reverse, respectively. The count of HQ bases must be <= total bases.

Both agree that there are 4 reference bases, which makes sense. They also imply that there are 195 G bases and 195 C bases, which is obviously impossible with the total of 249 bases at the position. I interpret this as an error from BCFtools, and the lack of the original samtools mpileup means that I'll need to develop an alternative strategy for accurately counting distinct alternate alleles at multi-allelic sites.

ChaoLab commented 2 years ago

I searched and found this: https://github.com/alimanfoo/pysamstats Maybe pysam package can work?

KGerhardt commented 2 years ago

I think I have a version working now - 0.0.60.

There is still some retained behavior from samtools leftover as of version 1.15 - the pileup function is operating for me using that version of samtools with an altered command and subprocess method in python.

Please give it a try. Thank you again for your patience and helpfulness.

ChaoLab commented 2 years ago

Hi @KGerhardt Many thanks for your efforts in addressing this issue. Here I got this error report (I used MetaPop v0.0.60 and Samtools v1.15 within conda):

Linking SNPs starting at: 05/04/2022 16:58:26...Traceback (most recent call last):
  File "/home/zhichao/miniconda3/envs/metapop/bin/metapop", line 8, in <module>
    sys.exit(main())
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_main.py", line 329, in main
    linked_file = metapop.metapop_mine_reads.do_mine_reads(output_directory_base, threads)
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_mine_reads.py", line 477, in do_mine_reads
    formatted_to_valid_combos(formatted_snps, res, output_file)
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_mine_reads.py", line 373, in formatted_to_valid_combos
    new_keys.append(new)
UnboundLocalError: local variable 'new' referenced before assignment

Attached is the log file: run_metapop.log

"09.Linked_SNPs" dir only contains "linked_snp_results.tsv" with the header line. "10.Microdiversity" dir is empty. This is for the test data run.

ChaoLab commented 2 years ago

Hi @KGerhardt Recently, I tried MetaPop on running my data. I got a different error message as this:

Traceback (most recent call last):
  File "/home/zhichao/miniconda3/envs/metapop/bin/metapop", line 8, in <module>
    sys.exit(main())
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_main.py", line 313, in main
    joined_fastas, reference_genes = metapop.metapop_snp_call.call_variant_positions(output_directory_base, joined_fastas, min_obs, min_q, min_pct, threads, ref_samp, reference_genes)
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_snp_call.py", line 159, in call_variant_positions
    new_genomes, new_genes = associate_genes(directory, consensus, snp_inputs, min_obs, min_pct, ref_genes)
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_snp_call.py", line 746, in associate_genes
    seq[pos_in_gene] = base_replace
IndexError: list assignment index out of range

This time, all the first eight directories were generated, while "09.Linked_SNPs" and "10.Microdiversity" directories were not provided (directories not existing in "MetaPop" resulting folder)

yan1365 commented 2 years ago

Hi @KGerhardt , I got a similar issue. The 09.Linked_SNPs and 10.Microdiversity dirs are empty. I checked the log file and it appears to be associated with samtools.

MetaPop SNP refinement finished at: 18/04/2022 00:42:55 Linking SNPs starting at: 18/04/2022 00:43:00...samtools view: unrecognised option '--output'

The version of Samtools is 1.15.1

Thanks

yan1365 commented 2 years ago

Hi @KGerhardt , I got a similar issue. The 09.Linked_SNPs and 10.Microdiversity dirs are empty. I checked the log file and it appears to be associated with samtools.

MetaPop SNP refinement finished at: 18/04/2022 00:42:55 Linking SNPs starting at: 18/04/2022 00:43:00...samtools view: unrecognised option '--output'

The version of Samtools is 1.15.1

Thanks

Hi, I previously ran the program with the "--no_macro" flag. I removed the flag and ran the whole pipeline again and expected results were produced.

ChaoLab commented 2 years ago

Hi @KGerhardt Recently, I tried MetaPop on running my data. I got a different error message as this:

Traceback (most recent call last):
  File "/home/zhichao/miniconda3/envs/metapop/bin/metapop", line 8, in <module>
    sys.exit(main())
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_main.py", line 313, in main
    joined_fastas, reference_genes = metapop.metapop_snp_call.call_variant_positions(output_directory_base, joined_fastas, min_obs, min_q, min_pct, threads, ref_samp, reference_genes)
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_snp_call.py", line 159, in call_variant_positions
    new_genomes, new_genes = associate_genes(directory, consensus, snp_inputs, min_obs, min_pct, ref_genes)
  File "/home/zhichao/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_snp_call.py", line 746, in associate_genes
    seq[pos_in_gene] = base_replace
IndexError: list assignment index out of range

This time, all the first eight directories were generated, while "09.Linked_SNPs" and "10.Microdiversity" directories were not provided (directories not existing in "MetaPop" resulting folder)

This is probably due to that I provide a custom gene file instead of using a Prodigal-generated gene file. The custom gene file should be strictly in the same format/shape as a Prodigal-generated one. The gene ID should be re-ordered (naturally ordered) if you want to use a custom gene file. For instance, 11 is in front of 2 , 3, and 4, which needs to be re-ordered. Meanwhile, the gene ID should start from 1, and the start and stop points of each gene in a scaffold should start from the very beginning of the scaffold (but not the intermediate place of a scaffold).