nextstrain / seasonal-flu

Scripts. config, and snakefiles for seasonal-flu nextstrain builds
44 stars 26 forks source link

Running issue nextstrain pipeline #149

Open sekhwal opened 6 months ago

sekhwal commented 6 months ago

I am trying to run the nextstrain pipeline, I did following steps but I am getting an error.

Step 1. I downloaded seasonal-flu repo and created a directory called "data/h3n2", and placed the fasta files. 1. raw_sequences_ha.fasta (downloaded from the gisaid database 2. raw_ha.fasta (my query H3N2 HA fasta file).

Step 2. build the "seasonal-flu/profiles/gisaid/build.yaml" file

Step 3. Install nextstrain cli on conda and run the following command.

nextstrain build . --configfile profiles/gisaid/builds.yaml --use-conda --conda-frontend mamba

I do not have lat-longs: "config/lat_longs.tsv" file so I am using the default and not have a metadata file in the data/h3n2 folder.

error

Building DAG of jobs... /home/mmk6053/.nextstrain/runtimes/conda/env/bin/bash: line 1: conda: command not found Traceback (most recent call last): File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/init.py", line 792, in snakemake success = workflow.execute( File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/workflow.py", line 1078, in execute dag.create_conda_envs( File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 359, in create_conda_envs env.create(dryrun) File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 393, in create pin_file = self.pin_file File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get value = self.method(instance) File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 103, in pin_file f".{self.conda.platform}.pin.txt" File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get value = self.method(instance) File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 96, in conda return Conda( File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 653, in init shell.check_output(self._get_cmd("conda info --json"), text=True) File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/shell.py", line 61, in check_output return sp.check_output(cmd, shell=True, executable=executable, *kwargs) File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 421, in check_output return run(popenargs, stdout=PIPE, timeout=timeout, check=True, File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127.

builds.yaml

custom_rules:

metadata_fields:

renamed_metadata_fields:

lat-longs: "config/lat_longs.tsv"

segments:

submission_date_field: date

recency: date_bins: [7, 30, 90] date_bin_labels: ["last week", "last month", "last quarter"] upper_bin_label: older

builds: "h3n2": lineage: h3n2 reference: "config/h3n2/{segment}/reference.fasta" annotation: "config/h3n2/{segment}/genemap.gff" tree_exclude_sites: "config/h3n2/{segment}/exclude-sites.txt" clades: "config/h3n2/ha/clades.tsv" subclades: "config/h3n2/ha/subclades.tsv" auspice_config: "config/h3n2/auspice_config.json" enable_lbi: true enable_glycosylation: true subsamples: global: filters: "--group-by region year month --subsample-max-sequences 100"

joverlee521 commented 6 months ago

Sorry you ran into this error @sekhwal! Looks like we need to update the instructions in the Quickstart with GISAID data.

The --use-conda flag should not be used with the Nextstrain managed environments. As @victorlin previously stated in a separate issue:

--use-conda falls under the advanced use case of managing Conda environments, which we recommend the ambient runtime for.

Please try running the workflow again with:

nextstrain build . --configfile profiles/gisaid/builds.yaml
sekhwal commented 6 months ago

When I run the following command on the "seasonal flu" repo, it runs and generates files in auspice folder. But it seems it does not include the fasta and metadata files that I provided. Please let me know how to run the pipeline with real data that I downloaded from gisaid and one sequence that I my own.

I am using nextstrain pipeline on the conda environment using "conda activate clade_env". I followed all the steps of "https://github.com/nextstrain/seasonal-flu". Created folder data/h3n2 and added metadata.xls and raw_sequences_ha.fasta.

nextstrain shell . nextstrain build . --configfile profiles/gisaid/builds.yaml

nextstrain view auspice/

joverlee521 commented 6 months ago

Just to clarify, did you add your own sequence to metadata.xls and raw_sequences_ha.fasta? Or is it in separate files? The workflow currently does not support multiple inputs, so you would have to manually add your own data to the GISAID data.

sekhwal commented 6 months ago

Yes, I have added my own sequence information to metadata.xls that I downloaded from GISAID and the fasta sequence to raw_sequences_ha.fasta and placed these files in data/h3n2 folder. When I run the pipeline, it generates files "ha.fasta", metadata.tsv in data/h3n2 folder and in addition "auspice", "benchmarks", "builds" and "logs". Then I run the command "nextstrain view auspice/" so it generates the phylogenetic tree but I can not find my own sequence in the tree. If possible can I email you the metadata.xls file and the "raw_sequences_ha.fasta".

sekhwal commented 6 months ago

Do you think that my own sequence belongs to clade "3C.2a1b.2a.2b" and in the metadata, other sequences that I downloaded from GISAID, none of them are not from this clade, so I am not getting it in the tree?

joverlee521 commented 6 months ago

Ah, I see. Your sequences may be randomly removed during subsampling.

Subsampling parameters are defined in the profiles/gisaid/builds.yaml file: https://github.com/nextstrain/seasonal-flu/blob/13a42114091747c422f61875a028b395289fed4f/profiles/gisaid/builds.yaml#L45-L47

These parameters are being used by augur filter within the workflow. You can read more about augur filter to determine if you want to change the subsampling parameters.


For your specific case of wanting to include your own sequences in the tree, it is possible to force sequences to be included regardless of subsampling parameters by using the --include flag.

  1. Create a text file that lists the sequences that you want to force include, with a single strain name per line. This file must be added within the seasonal-flu directory, for example you can add it to profiles/gisaid/include.txt.
  2. Edit the parameters in the profiles/gisaid/builds.yaml file to use your include.txt file.
    filters: "--group-by region year month --subsample-max-sequences 100 --include profiles/gisaid/include.txt" 
  3. Re-run the workflow with the --forceall flag.
    nextstrain build . --configfile profiles/gisaid/builds.yaml --forceall
sekhwal commented 6 months ago

Just little more clarification. What should I include in the include.txt, a fasta sequence of my own file or the ID of my sequence that I added it in raw_sequences_ha.fasta and metadata.xls. Is there any template of include.txt?

joverlee521 commented 6 months ago

The include.txt should have a single sequence ID per line, matching the sequence ID used in the metadata and FASTA file.

sekhwal commented 6 months ago

I added a single seq ID "EPI_ISL_170" in include.txt and updated the builds.yaml

filters: "--group-by region year month --subsample-max-sequences 1000 --include profiles/gisaid/include.txt"

However, it is not showing in the tree.

It is my seq header

A/Pennsylvania/01/2022 | EPI_ISL_170 | A / H3N2 | |3C.2a1b.2a.2b | | 2022-01-01

metadata file screenshot

Screenshot from 2024-02-16 16-14-42

sekhwal commented 6 months ago

Here is the clades_h3n2.txt log file, I think there might be an issue of clade mismatching?

clades_h3n2.txt

joverlee521 commented 6 months ago

I added a single seq ID "EPI_ISL_170" in include.txt and updated the builds.yaml

Sorry, when I said the sequence ID, I mean the first part of the FASTA header. You should add the strain name "A/Pennsylvania/01/2022" to include.txt.

sekhwal commented 6 months ago

Sorry, even if I add this strain name "A/Pennsylvania/01/2022" to include.txt, it shows the following error.

error

[Fri Feb 16 16:54:04 2024] Error in rule annotate_epiweeks: jobid: 17 input: builds/h3n2/metadata.tsv output: builds/h3n2/epiweeks.json log: logs/annotate_epiweeks_h3n2.txt (check log file(s) for error details) conda-env: /home/mmk6053/Dropbox/influenzadata/seasonal-flu/.snakemake/conda/b2dc7dab0f55bf6c5699067912b18fca shell:

    python3 scripts/calculate_epiweek.py             --metadata builds/h3n2/metadata.tsv             --output-node-data builds/h3n2/epiweeks.json 2>&1 | tee logs/annotate_epiweeks_h3n2.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Fri Feb 16 16:54:04 2024] Finished job 18. 5 of 26 steps (19%) done [Fri Feb 16 16:54:04 2024] Finished job 9. 6 of 26 steps (23%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-02-16T165400.558340.snakemake.log

joverlee521 commented 6 months ago

There should be more error details in the log file. Can you check the logs/annotate_epiweeks_h3n2.txt file?

sekhwal commented 6 months ago

I tried running the pipeline in different ways, but it does not include my sequence in the tree. Also when I filter the gisaid downloaded sequences such as one seq per clade to make less number of sequences, the pipeline does not allow this formatted file and generates an error. I think it is strictly designed to run GISAID formatted data only.

I will use multiple seq alignment software such as mafft and will run Treetime and auspice.

joverlee521 commented 6 months ago

It was nice to meet you in office hours today @sekhwal! In case you lose the link that @huddlej had shared during office hours, here's the example of the more complex subsampling scheme:

https://github.com/nextstrain/seasonal-flu/blob/d439f897f2819961dee431f92ea4e6f45343688a/profiles/nextstrain-public.yaml#L78-L86

sekhwal commented 6 months ago

Thank you for sharing the sub-sampling scheme, I would like to discuss in our next week meeting to understand more about the scheme. When I download the data from 2018 to 2023 from GISAID, it comes a large datasets. How I can use the sub-sampling scheme to filter the data and get a decent number of dataset to make a tree.

sekhwal commented 6 months ago

Also, I tried running seasonal_flu repo with include.txt but still I am not able to find my seq in the analysis. I am using the filter "filters: "--group-by region year month --subsample-max-sequences 3300 --include profiles/gisaid/include.txt" in the builds.yaml.

huddlej commented 6 months ago

@sekhwal and I met today to walk through the workflow and figure out where his local sequence was getting dropped. We found that it was dropped at the "flag outliers" step because its collection date was later than expected by the defined clock rate. We added an include entry to the GISAID build config to keep his strain from getting added to those outliers. After this change, the strain got pruned by augur refine's clock filter. We figured that the sequence was dropped because it was a 2022 sequence in a tree with sequences collected up to December 2020, so the sequence would appear to be an outlier. We also realized that the collection date had an ambiguous month and day, so we modified the starting metadata to have a collection date of 2022-XX-XX. This change should force augur refine to infer the likely date of the sequence from the rest of the sequences in the tree.

@sekhwal is going to try the analysis again with more sequences from 2021 and 2022 so his sequence has more appropriate temporal context. Still, we may want to add an option to the workflow that allows users to disable the clock filter in augur refine.