hecatomb's same output files from the same dataset are with different sizes

Leran10 commented 3 years ago

Hi,

Me and My colleague are running hecatomb on the same dataset. But the same output tables generated from us are with different rows or different number of sequences:

mine:

wc -l bigtable.tsv
**33073** bigtable.tsv

seqkit stat seqtable.fasta
file            format  type  num_seqs     sum_len  min_len  avg_len  max_len
seqtable.fasta  FASTA   DNA    **124,014**  28,329,286       90    228.4      250

My colleague:

wc -l assembly.fasta
**23998** assembly.fasta

seqkit stat seqtable.fasta
file            format  type  num_seqs     sum_len  min_len  avg_len  max_len
seqtable.fasta  FASTA   DNA    **124,041**  28,332,842       90    228.4      250

We are not sure if these steps have finished or not. Because we both failed at the "sankey_diagram" step.

I wanted to checked .err files in the LOG folder to see if the steps that generated those files were finished or not, but couldn't make sure which folders are the right ones to go.

I think If there could be a final .log file says "The entire hecatomb pipeline has successfully finished! " generated after the whole pipeline is done, It'll be very helpful.

Thanks! Leran

beardymcjohnface commented 3 years ago

Hi Leran, The crash I think relates to #34 and the logging should be addressed in the next build. Let's revisit it once the pipeline is finishing correctly.

mihinduk commented 3 years ago

Some additional data: 134 samples run on Pathogen: assembly.fasta:24280 seqtable.fasta: 35,306,586

Same 134 samples running on HTCF (MMSEQS_AA_PRIMARY step): assembly.fasta:24568 seqtable.fasta: 9,344,202

I don't think the second seqtable will be added to after this step

mihinduk commented 3 years ago

It seems like the difference is mainly arising during clustering:

Pathogen: more p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log Size of the sequence database: 341461 Size of the alignment database: 341461 Number of clusters: 325347

more p08_remove_exact_dups.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log ***** Input: 971106 reads 223037112 bases. Duplicates: 65529 reads (6.75%) 15441315 bases (6.92%) 0 collisions. Result: 905577 reads (93.25%) 207595797 bases (93.08%)

HTCF (slurm): more p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log *** Size of the sequence database: 147516 Size of the alignment database: 147516 Number of clusters: 77134**

more p08_remove_exact_dups.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log Input: 971504 reads 223125309 bases. Duplicates: 65541 reads (6.75%) 15444031 bases (6.92%) 0 collisions. Result: 905963 reads (93.25%) 207681278 bases (93.08%)

shandley commented 3 years ago

This very well may be due to some changes we made to the clustering parameters for linclust. There was some toying with these settings over the past month, so if you ran the Pathogen run a while back and the HTCF run more recently you would likely get different results.

Can you check the cluster settings for each run (should be in your config file, but may be in the rule file depending on which version and when Mike made updates). That large of a difference is more likely explained by a major setting change than a compute error.

mihinduk commented 3 years ago

Here are the differences: Pathogen: mmseqs easy-linclust hecatomb_out/PROCESSING/TMP/p08/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1.deduped.out.fastq hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1 hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_TMP --kmer-per-seq-scale 0.3 -c 0.97 --cov-mode 1 --threads 16 &> hecatomb_out/STDERR/p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log;

HTCF: mmseqs easy-linclust hecatomb_out/PROCESSING/TMP/p08/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1.deduped.out.fastq hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1 hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34 044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_TMP --kmer-per-seq-scale 0.3 -c 0.7 --cov-mode 1 --min-seq-id 0.95 --alignment-mode 3 --threads 24 &> hecatomb_out/STDERR/p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log;

mihinduk commented 3 years ago

NOTE: -c = minimum coverage originally 0.97 to 0.7 --min-seq-id + --alignment-mode 3 (number of identical aligned residues divided by the number of aligned columns including internal gap columns) -> now 0.95

not clear to me what happens without these params

shandley commented 3 years ago

Well dropping from 0.97 to 0.7 is going to have the greatest effect and is likely way lower than what we intend. This should be set back to 0.97 or 0.95.

You can read about alignment mode here: https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster

@mroach-awri can we go back to my original settings as the default for now?

beardymcjohnface commented 3 years ago

Yes I think I pushed my settings to the last build. The original default (-c .97 --cov-mode 0) I think specifies 97% residue matches in the longer sequence, which will only cluster sequences that are end-to-end almost identical and most clusters are n=1. The current settings (--cov-mode 1 -c 0.7 --min-seq-id 0.95) specify 70% alignment coverage of the member sequence by the rep sequence at 95% identity. These are the setting I'll probably be using as I want to maximize runtime performance.

I'll be pushing a new build soon; did you want the original defaults or did you have a specific coverage and seq identity in mind?

mihinduk commented 3 years ago

Hi Mike,

We need the original settings. Could the faster settings be a switch?

Kathie

From: beardymcjohnface @.> Reply-To: shandley/hecatomb @.> Date: Saturday, October 30, 2021 at 5:48 PM To: shandley/hecatomb @.> Cc: "Mihindukulasuriya, Kathie" @.>, Comment @.***> Subject: Re: [shandley/hecatomb] hecatomb's same output files from the same dataset are with different sizes (Issue #35)

External Email - Caution *

Yes I think I pushed my settings to the last build. The original default (-c .97 --cov-mode 0) I think specifies 97% residue matches in the longer sequence, which will only cluster sequences that are end-to-end almost identical and most clusters are n=1. The current settings (--cov-mode 1 -c 0.7 --min-seq-id 0.95) specify 70% alignment coverage of the member sequence by the rep sequence at 95% identity. These are the setting I'll probably be using as I want to maximize runtime performance.

I'll be pushing a new build soon; did you want the original defaults or did you have a specific coverage and seq identity in mind?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/shandley/hecatomb/issues/35#issuecomment-955603827, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDIRETLF5YUQWZSUV5LUJRY2LANCNFSM5G5YEEEQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Leran10 commented 3 years ago

Running: grep \> seqtable.fasta | sed 's/.*:$.*$:.*/\1/' | awk '{n+=$1}END{print n}' gives consistent number of reads on seqtable.fasta file.

Leran

shandley / hecatomb

hecatomb's same output files from the same dataset are with different sizes #35