goldman-gp-ebi / SWAMPy

Other
10 stars 5 forks source link

Using SWAMPy for other pathogens #7

Open mariaelf97 opened 7 months ago

mariaelf97 commented 7 months ago

Dear authors!

Good day to you. I was wondering how I can use SWAMPy for other pathogens? Which parts of the code should I modify? I have different reference genomes and primer sets. Is there any parts of the code that uses Covid reference genome as default?

Regards, Maryam

rabiafidan commented 7 months ago

Dear Maryam,

Thank you for your interest in SWAMPy! Yes, there are parts of the code that use SARS-CoV-2 genome as default. Which organism do you have in your mind? Genome size is an important consideration for this kind of modification. Also, the code is not structured to accommodate multiple chromosomes, if that is applicable. If your organism has a single chromosome and a relatively small genome size, it should be possible and we can give you pointers on how to do it.

All the best, Rabia

mariaelf97 commented 7 months ago

Dear @rabiafidan Thank you for your response. The pathogen of interest is a bacteria with 1 chromosome. The length is approximately 4 mbps.

Regards, Maryam

rabiafidan commented 6 months ago

Dear Maryam,

Sorry for the late response. Sounds like you can go ahead and try it! Just remember, SWAMPy is intended to simulate whole genome amplicon sequencing with overlapping amplicons on an Illumina machine. If you have a different construct, it is also a consideration.

The step that uses SARS-CoV-2 genome as default is adding the PCR error.

You will see add_PCR_errors function (from PCR_error.py) takes WUHAN_REF argument. This argument is defined in the main script simulate_metagenome.py line 38.

  1. Firstly, you would need to add the reference files into the ref directory. You should add the FASTA file of your bacterium and the bowtie2 indices (the ones that have bt2 extension).
  2. You will need to modify line 38 at simulate_metagenome.py. Change "MN908947.3" with your FASTA file name, without the .fasta or .fa extension.
  3. Change the chromosome name "MN908947.3" at line 285 of PCR_error.py with the chromosome name of your bacterium.
  4. Then follow the steps of issue #2 to add your suitable amplicon sets.

Hopefully, it should work. Let me know if you are unsure about how to do any of the steps and how it goes!

rabiafidan commented 6 months ago

I have just noticed a point that might not have been clear. The reference genome you want to add to the ref folder is the ancestral strain (equivalent to the Wuhan variant of SAS-CoV-2), not necessarily the strain you want to simulate. This is to ensure PCR errors are consistent across different variants of the bacterium in your sample.

Hope it makes sense!

mariaelf97 commented 6 months ago

Hi, Thank you for the information. I proceeded as you instructed and I am getting the following error:

python simulate_metagenome.py --genomes_file all_seqs.fasta --temp_folder temp --genome_abundances .abundances.tsv --primer_set a1 --output_folder output --read_length 400 --seed 10 --no_pcr_error```

The first error I got was:

import pandas as pd 2024-02-21 10:48:25,444 [INFO] Primer set: Artic v1 2024-02-21 10:48:25,444 [INFO] Number of reads: 100000 2024-02-21 10:48:25,445 [INFO] Random seed: 10 2024-02-21 10:48:25,445 [INFO] Amplicon pseudocounts/ i.e. quality parameter: 200 2024-02-21 10:48:25,488 [INFO] Spaces in the genomes and abundances files are processed as '_' characters if exist 2024-02-21 10:48:25,489 [INFO] Total of relative abundance values is 100.0, not 1. 2024-02-21 10:48:25,489 [INFO] Continuing, normalising total of genome abundances to 1. 2024-02-21 10:48:25,616 [INFO] Working on genome 1 of 4 2024-02-21 10:48:25,617 [INFO] Using bowtie2 to align primers to genome MTB-Oman-321769-L1 Building a SMALL index Renaming temp/indices/1.3.bt2.tmp to temp/indices/L1.3.bt2 Renaming temp/indices/L1.4.bt2.tmp to temp/indices/L1.4.bt2 Renaming temp/indices/L1.1.bt2.tmp to temp/indices/L1.1.bt2 Renaming temp/indices/L1.2.bt2.tmp to temp/indices/L1.2.bt2 Renaming temp/indices/L1.rev.1.bt2.tmp to temp/indices/L1.rev.1.bt2 Renaming temp/indices/L1.rev.2.bt2.tmp to temp/indices/L1.rev.2.bt2 Traceback (most recent call last): File "simulate_metagenome.py", line 344, in df = align_primers(genome_path, genome_filename_short, INDICES_FOLDER, PRIMERS_FILE, False)
File "SWAMPy/src/create_amplicons.py", line 26, in align_primers df = pd.read_csv(alignment, sep="\t", skiprows=[0, 1, 2], header=None, names=[i for i in range(19)]) File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv return _read(filepath_or_buffer, kwds) File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read return parser.read(nrows) File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read ) = self._engine.read( # type: ignore[attr-defined] File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read chunks = self._reader.read_low_memory(nrows) File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 114, saw 20

I fixed it by changing line 26 in create_amplicons 
`df = pd.read_csv(alignment, sep="\t", skiprows=[0, 1, 2], header=None, names=[i for i in range(20)])`

Then I got another error

Traceback (most recent call last): File "SWAMPy/src/simulate_metagenome.py", line 365, in df_amplicons["hyperparameter"] = df_amplicons.apply(amplicon_hyperparameter_sampler, axis=1) File "envs/swampy/lib/python3.10/site-packages/pandas/core/frame.py", line 10347, in apply return op.apply().finalize(self, method="apply") File "envs/swampy/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply return self.apply_standard() File "envs/swampy/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard results, res_index = self.apply_series_generator() File "envs/swampy/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator results[i] = self.func(v, *self.args, **self.kwargs) File "SWAMPy/src/read_model.py", line 45, in hyperparam_sampler return hyperparams[dataframe_row.amplicon_number] KeyError: 0



Is this because I did not provide the amplicon_distribution file?
It seems that the amplicon fasta files were created in the temp folder.
mariaelf97 commented 6 months ago

Is this normal to have empty amplicon.fasta files?

mariaelf97 commented 6 months ago

all_seqs.fasta has 4 contigs, each size of about 4 MBP genome_abundances look like this L1 30 L2 40 L3 20 L4 10 primer file is the same as the format you provided in fastq and bed format

rabiafidan commented 6 months ago

Dear Maryam,

Not only the amplicon distribution file but also other steps are necessary to add your custom amplicon set. Please carefully follow #2 Empty amlicon files is not normal. I see that you ran your command with --primer_set a1, which is artic v1 primer set. That wouldn't align to your organism, hence no amplicons. You should use your custom amplicon set for your organism.

mariaelf97 commented 6 months ago

I did follow the instruction and replaced the files with my amplicons

rabiafidan commented 6 months ago

Then I don't know what the problem is, sorry! I can only suggest the obvious, debugging :) You got pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 114, saw 20 so you can start by looking at line 114, how is it different from line 113, why is it 20 fields...