raw-lab / MetaCerberus

Python code for versatile Functional Ontology Assignments for Metagenomes searching via Hidden Markov Model (HMM) with environmental focus of shotgun metaomics data
BSD 3-Clause "New" or "Revised" License
48 stars 7 forks source link

Step 08 and 09 dropping samples? #13

Closed ebueren closed 5 months ago

ebueren commented 6 months ago

LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 7.9 (Maipo) Release: 7.9 Codename: Maipo

Hello! I thought it might be best to open a new ticket since I'm not sure this is related to the same issue as reformatting fastqs as before: (https://github.com/raw-lab/MetaCerberus/issues/5 ).

Something seems to be going awry in the hmmer and parsing steps when I run fastqs through metacerberus.

I thought it was my reads at first, but I switched to trying to use the example reads (Test_R1.fastq/Test_R2.fastq files) and am still having issues. Currently, it seems like metacerberus runs to completion but skips step 08 (hmmer search)? As a result, the final folder step_10-visualizeData/ is empty except for the presence of a combined/ folder, which has empty data .html and .tsv files.

There does not seem to be any error output for the test samples-- it just prints out the steps with no details and moves on.


STEP 1: Loading sequence files:
Processing 4 fastq sequences
Processing 0 fasta sequences
Processing 0 protein sequences
Processing 0 rollup files
STEP 3: Trimming fastq files
PAIRED: prodigal_Test_R1 /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test_ex/Test_R1.fastq
prodigal_Test
PAIRED: FragGeneScan_Test_R1 /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test_ex/Test_R1.fastq
FragGeneScan_Test
Trimming: prodigal_Test /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out2/step_03-trim/prodigal_Test/merged/Test_merged.fastq
Trimming: FragGeneScan_Test /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out2/step_03-trim/FragGeneScan_Test/merged/Test_merged.fastq
STEP 4: Decontaminating trimmed files
STEP 5b: Reformating FASTQ files to FASTA format
STEP 7: ORF Finder
STEP 8: HMMER Search
STEP 10: Creating Reports
Saving Statistics
Creating Rollup Tables
Creating Count Tables
NOTE: PCA Tables created only when there are at least four sequence files.
NOTE: Pathview created only when there are at least four sequence files.
Creating combined sunburst and bargraphs

I am not sure if it is the same issue or something to do with my reads specifically, but I have a similar problem (I think) when I run metacerberus with my own reads. The majority of my samples get dropped at steps either 08 or 09.

For example, if my input is paired reads for Sample1, Sample2, 3, 4, 5, 6, 7, 8, 9 and 10, and I run it on super mode (prodigal and fragscan), trimming will occur for all sets succesfully. But then at around step08 (hmmer) about half of the FragScan versions of samples are not included in output folders, and then more are dropped in step09 (parsing). By step-10-- the output will include only one or two outputs, ex: Fragscan_Sample11 and Prodigal_Sample3. These will both be included included in the combined output folder. But all other samples are missing

For the runs with multiple samples and drop issues, output has some detail:

STEP 8: HMMER Search
Sending to HMMER:
prodigal_8_Smoot_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_8_Smoot_unP/proteins.faa
['KOFam_all-prodigal_2_Reid_unP']
prodigal_6_Lincoln_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_6_Lincoln_unP/proteins.faa
['KOFam_all-prodigal_2_Reid_unP', 'KOFam_all-prodigal_2_Reid_unP']
prodigal_3_Fell_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_3_Fell_unP/proteins.faa
['KOFam_all-prodigal_2_Reid_unP', 'KOFam_all-prodigal_2_Reid_unP', 'KOFam_all-prodigal_2_Reid_unP']
prodigal_2_Reid_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_2_Reid_unP/proteins.faa
['KOFam_all-prodigal_2_Reid_unP', 'KOFam_all-prodigal_2_Reid_unP', 'KOFam_all-prodigal_2_Reid_unP', 'KOFam_all-prodigal_2_Reid_unP']
Sending to HMMER:
prodigal_13_Carnation_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_13_Carnation_unP/proteins.faa
['KOFam_all-prodigal_5_Frankfort_unP']
prodigal_7_Mitchell_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_7_Mitchell_unP/proteins.faa
['KOFam_all-prodigal_5_Frankfort_unP', 'KOFam_all-prodigal_5_Frankfort_unP']
prodigal_12_KettleFalls_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_12_KettleFalls_unP/proteins.faa
['KOFam_all-prodigal_5_Frankfort_unP', 'KOFam_all-prodigal_5_Frankfort_unP', 'KOFam_all-prodigal_5_Frankfort_unP']
prodigal_5_Frankfort_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_5_Frankfort_unP/proteins.faa
['KOFam_all-prodigal_5_Frankfort_unP', 'KOFam_all-prodigal_5_Frankfort_unP', 'KOFam_all-prodigal_5_Frankfort_unP', 'KOFam_all-prodigal_5_Frankfort_unP']
Sending to HMMER:
prodigal_9_Hamilton_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_9_Hamilton_unP/proteins.faa
['KOFam_all-FragGeneScan_8_Smoot_unP']
FragGeneScan_6_Lincoln_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/FragGeneScan_6_Lincoln_unP/proteins.faa
['KOFam_all-FragGeneScan_8_Smoot_unP', 'KOFam_all-FragGeneScan_8_Smoot_unP']
prodigal_4_Elliston_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_4_Elliston_unP/proteins.faa
['KOFam_all-FragGeneScan_8_Smoot_unP', 'KOFam_all-FragGeneScan_8_Smoot_unP', 'KOFam_all-FragGeneScan_8_Smoot_unP']
FragGeneScan_8_Smoot_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/FragGeneScan_8_Smoot_unP/proteins.faa
['KOFam_all-FragGeneScan_8_Smoot_unP', 'KOFam_all-FragGeneScan_8_Smoot_unP', 'KOFam_all-FragGeneScan_8_Smoot_unP', 'KOFam_all-FragGeneScan_8_Smoot_unP']
Sending to HMMER:
FragGeneScan_3_Fell_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/FragGeneScan_3_Fell_unP/proteins.faa
['KOFam_all-FragGeneScan_13_Carnation_unP']
FragGeneScan_7_Mitchell_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/FragGeneScan_7_Mitchell_unP/proteins.faa
['KOFam_all-FragGeneScan_13_Carnation_unP', 'KOFam_all-FragGeneScan_13_Carnation_unP']
FragGeneScan_2_Reid_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/FragGeneScan_2_Reid_unP/proteins.faa
on_unP']

STEP 8: Filtering HMMER results
STEP 9: Parse HMMER results
Sending to HMMER:
prodigal_1_KingGeorge_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_1_KingGeorge_unP/proteins.faa
['KOFam_all-FragGeneScan_11_DeerPark_unP']
prodigal_11_DeerPark_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/prodigal_11_DeerPark_unP/proteins.faa
['KOFam_all-FragGeneScan_11_DeerPark_unP', 'KOFam_all-FragGeneScan_11_DeerPark_unP']
FragGeneScan_1_KingGeorge_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/FragGeneScan_1_KingGeorge_unP/proteins.fa
a
['KOFam_all-FragGeneScan_11_DeerPark_unP', 'KOFam_all-FragGeneScan_11_DeerPark_unP', 'KOFam_all-FragGeneScan_11_DeerPark_unP']
FragGeneScan_11_DeerPark_unP /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_07-geneCall/FragGeneScan_11_DeerPark_unP/proteins.faa
['KOFam_all-FragGeneScan_11_DeerPark_unP', 'KOFam_all-FragGeneScan_11_DeerPark_unP', 'KOFam_all-FragGeneScan_11_DeerPark_unP', 'KOFam_all-FragGeneScan_11_De
erPark_unP']
STEP 10: Creating Reports
Saving Statistics
ERROR: Target on line 2 of HMMER file not in protein fasta: /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_08-hmmer/FragGeneScan_1
3_Carnation_unP/filtered.tsv
ERROR: Target on line 371 of HMMER file not in protein fasta: /projects/x/eb2/x/07_func/mega/02_metacerb_fq/test.out/step_08-hmmer/FragGeneScan
_11_DeerPark_unP/filtered.tsv
Creating Rollup Tables
Creating Count Tables
PCA Analysis
Creating combined sunburst and bargraphs
Finished Pipeline

I've tried inputting files both directly in the command and as an exterior config .yaml file. Any ideas?

Happy to send you some of my reads/output if you think that would help, just let me know!

raw-lab commented 6 months ago

What version of MetaCerberus are you using? metacerberus.py -v

raw-lab commented 6 months ago

Also, how long are your reads?

ebueren commented 6 months ago

I have MetaCerberus: version: 1.2.1 February 2024

**Edit, I misread your question, my bad. My individual .fq files range from ~200MB - 800MB with around 200bp for individual reads, based on a quick check of a few samples.

At first glance based on which samples made it to the end, size doesn't seem to be the reason some are being dropped vs. kept, but this is just an eyeball so I will do a little bit of a dig to check.

raw-lab commented 6 months ago

Did they make it to the end? Can you attach some of your reads so we can check? Are they overlapping libraries? What was the insert size?

raw-lab commented 6 months ago

Also, provide the command you used?

raw-lab commented 6 months ago

Is it only the fraggenescan that is dropping out? We think this is a bug that we are fixing in 1.3 version. Is the prodigal still calling orfs?

ebueren commented 6 months ago

Command: metacerberus.py --super test_input/ --illumina --meta --dir_out test.out --cpus 16

Have also tried it with just metacerberus.py -c input.yaml with the same list of files/parameters

The reads are overlapping libraries with an insert size of 500bp.

Both fraggenescan and prodigal are dropping out at step_08.

Looking more closely at the intermediate files in step_08, it looks like the hmmer search is unsuccessful for the dropped samples.

Ex: I ran 12 samples through super, and at step_08 a program_sample folder was made for each run. The ones that were dropped at this stage (didn’t go on to step 09) only have one file — an empty tmp.err files. (ex: “KOFam_all-prodigal_12_KettleFalls_unP_tmp.err” — 0 bytes, is the only file in step_08-hmmer/prodigal_12_KettleFalls_unP)

The other weird thing is that none of my samples have been successful with both fraggenescan and prodigal. So a sample that worked for Fragenescan then did not work with prodigal as an ORF caller or vice versa.

I also noticed I'm having similar issues with annotating any assembled contigs, either made from these reads or any other files that previously worked in 1.1. I set up an env of metacerberus 1.1 and the assembled contig files run smoothly through those.

Attached reads are in a drive folder here: https://drive.google.com/file/d/1QK_8nNKdc5mBkeNcJcHUBrqJxvjb7WSc/view?usp=sharing

In v1.2. my runs have Sample 5 successfully completing with only Prodigal, Sample 8 only FragGeneScan, and Samples 6 and 9 failing both.

raw-lab commented 6 months ago

Hmm, it's quite odd. I took a subsample of the reads (beautiful quality scores by the way). I didn't have the same issues. We will take another look and get back to you as soon as possible. Thank you for your patience also helping us to find bugs in MetaCerberus.

What type of samples are they? Are they full of eukaryotes or protists? Or a plant genome?

We will get you up and running to results asap.

many thanks, RAW lab

ebueren commented 6 months ago

They are virome reads, but they haven't been filtered except for host contamination (bees)-- so if there is a high amount of bacterial or non-viral hits that is probably why!

Hmmm, I will do a fresh install on my end and see if that fixes it using the following:

conda create -n metacerberus -c conda-forge -c bioconda metacerberus -y
conda activate metacerberus
metacerberus.py --setup

Will let you know if anything changes on my end!

raw-lab commented 6 months ago

If you do the metacerberus.py --setup that should be good.

We are working on 1.3 which have more databases and a major update.

Filteration of host sequences is quite challenging. We are working on another tool for that.

It should havent them drop out. Could it be a memory issue?

We will work with the data you gave us to see if we can fix it. Thank you for your patience.

Many thanks, RAW lab

ebueren commented 6 months ago

So I reran it in a fresh environment in case my initial install went awry somewhere. The output I got still had drops but was different. This time FragGeneScan worked for Samples 5 and 6, and both 8 and 9 failed FragGeneScan and Prodigal.

It seems possible this might be a memory issue. This is my first time using Ray, is it possible I'm using it incorrectly? This is the sbatch I'm submitting, for context:


#!/usr/bin/env bash
#SBATCH --nodes=3
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=16

cd /projects/x/eb2/x/07_func/mega/02_metacerb_fq

source activate /home/ebueren/miniconda3/envs/metacerberusNEW

source ray-slurm-metacerberus.sh

metacerberus.py --super test_reads/ --illumina --meta --dir_out raw.out --cpus 16

I tried running with a request for more CPUs per task (below), and this time it ran faster (before getting to the hmmer step) but dropped all the runs at the hmmer stage. So there was no sucessful output.


#!/usr/bin/env bash
#SBATCH --nodes=3
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=64

cd /projects/x/eb2/x/07_func/mega/02_metacerb_fq

source activate /home/ebueren/miniconda3/envs/metacerberusNEW

source ray-slurm-metacerberus.sh

metacerberus.py --super test_reads/ --illumina --meta --dir_out raw.out --cpus 64
raw-lab commented 5 months ago

Good afternoon,

Thank you for you patience - 1.3 is loading to conda currently. The 1.3 version will fix this bug.

Thank you for using MetaCerberus!

many thanks, Rick