njdbickhart / SheepHiFiManuscript

A collection of scripts and workflows to analyze HiFi assembled metagenomes
9 stars 2 forks source link

Issues with Missing Scripts and ”mag_phaser.py“ Execution in the Snakemake Workflow #1

Open wyanren opened 1 year ago

wyanren commented 1 year ago

Dear Dr. Bickhart,

Hope this message finds you well. I have been working with the Snakemake workflow (magphase_workflow/MAGPhase) and encountered a couple of issues that I'd like to bring to your attention:

1. Missing Scripts in the git

I noticed that the scripts filter_bam_by_coverage.py and mag_phaser.py referenced in the Snakemake workflow are missing from this repository (found in other repository: MagPhase and cDNA_Cupcake)

2. Issue with mag_phaser.py

When running the mag_phaser.py process, I encounter a bug. I suspect the issue might be related to the bed file that I'm providing. The terminal output is as follows:

(magphase) $ python /home/wangyanren/software/MagPhase/phasing/mag_phaser.py --bhFDR 0.01 -a mags/bin.fasta -b mapping/assembly.filtered.bam -g beds/bin.bed -o phase2
--bhFDR 0.01 is given! Will be using Benjamini–Hochberg correction insteaad. --pval_cutoff is ignored.
[E::mpileup] fail to parse region 'bin.1_contig_71_85:84043-84417' with mapping/assembly.filtered.bam
Traceback (most recent call last):
  File "/home/wangyanren/software/MagPhase/phasing/mag_phaser.py", line 146, in <module>
    main(args, parser)
  File "/home/wangyanren/software/MagPhase/phasing/mag_phaser.py", line 81, in main
    for mpileupFile, contig, start, end in elitePileups(args.bamfile, args.genes, args.assembly, args.output):
  File "/home/wangyanren/software/MagPhase/phasing/mag_phaser.py", line 138, in elitePileups
    if subprocess.check_call(cmd, shell=True)!=0:
  File "/home/wangyanren/conda/envs/magphase/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'samtools mpileup -r bin.1_contig_71_85:84043-84417 -f mags/bin.fasta -s mapping/assembly.filtered.bam > phase2.bin.1_contig_71_85_84043_84417.pileup' returned non-zero exit status 1.
I have converted the CheckM results, which contain loci information for single-copy genes, into a bed file as follow: Contig ID Start End Gene Name
bin.1_contig_71_85 84043 84417 Ribosom_S12_S23
bin.1_contig_71_86 84514 84984 Ribosomal_S7
bin.1_contig_19_131 133946 136519 Ribosomal_S7
bin.1_contig_71_96 92125 92946 Ribosomal_L2
bin.1_contig_71_96 92125 92946 Ribosomal_L2
bin.1_contig_71_99 93606 94307 Ribosomal_S3_C
bin.1_contig_19_656 681281 681742 Ribosomal_S3_C
bin.1_contig_19_656 681281 681742 Ribosomal_S3_C
bin.1_contig_71_97 92963 93241 Ribosomal_S19
bin.1_contig_71_98 93256 93588 Ribosomal_L22
bin.1_contig_71_103 95340 95711 Ribosomal_L14
bin.1_contig_71_100 94320 94730 Ribosomal_L16
bin.1_contig_71_95 91805 92107 Ribosomal_L23
wyanren commented 1 year ago
I apologize for any inconvenience my previous communication may have caused. I have come to realize that the appropriate bed file format should include the following columns: Contig_ID, start position, end position, and Contig_gene_ID, as follows. After correcting this file, map_phasers.py could run now. Contig_ID start end Contig_gene_ID
bin.1_contig_2 213977 214351 bin.1_contig_2_236
bin.1_contig_2 213410 213880 bin.1_contig_2_235
bin.1_contig_2 205447 206268 bin.1_contig_2_225
bin.1_contig_2 205447 206268 bin.1_contig_2_225
bin.1_contig_2 204086 204787 bin.1_contig_2_222
bin.1_contig_61 158076 158537 bin.1_contig_61_179

Furthermore, I am wondering the recommended process (or working process in the manuscript) for generating the bed file specifically for single-copy genes. If you could kindly provide insights into this matter, I would greatly appreciate it.

Thank you for your time and consideration.

Best regards, Yanren