Open mariaelf97 opened 7 months ago
Dear Maryam,
Thank you for your interest in SWAMPy! Yes, there are parts of the code that use SARS-CoV-2 genome as default. Which organism do you have in your mind? Genome size is an important consideration for this kind of modification. Also, the code is not structured to accommodate multiple chromosomes, if that is applicable. If your organism has a single chromosome and a relatively small genome size, it should be possible and we can give you pointers on how to do it.
All the best, Rabia
Dear @rabiafidan Thank you for your response. The pathogen of interest is a bacteria with 1 chromosome. The length is approximately 4 mbps.
Regards, Maryam
Dear Maryam,
Sorry for the late response. Sounds like you can go ahead and try it! Just remember, SWAMPy is intended to simulate whole genome amplicon sequencing with overlapping amplicons on an Illumina machine. If you have a different construct, it is also a consideration.
The step that uses SARS-CoV-2 genome as default is adding the PCR error.
You will see add_PCR_errors
function (from PCR_error.py) takes WUHAN_REF argument. This argument is defined in the main script simulate_metagenome.py line 38.
Hopefully, it should work. Let me know if you are unsure about how to do any of the steps and how it goes!
I have just noticed a point that might not have been clear. The reference genome you want to add to the ref folder is the ancestral strain (equivalent to the Wuhan variant of SAS-CoV-2), not necessarily the strain you want to simulate. This is to ensure PCR errors are consistent across different variants of the bacterium in your sample.
Hope it makes sense!
Hi, Thank you for the information. I proceeded as you instructed and I am getting the following error:
python simulate_metagenome.py --genomes_file all_seqs.fasta --temp_folder temp --genome_abundances .abundances.tsv --primer_set a1 --output_folder output --read_length 400 --seed 10 --no_pcr_error```
The first error I got was:
import pandas as pd
2024-02-21 10:48:25,444 [INFO] Primer set: Artic v1
2024-02-21 10:48:25,444 [INFO] Number of reads: 100000
2024-02-21 10:48:25,445 [INFO] Random seed: 10
2024-02-21 10:48:25,445 [INFO] Amplicon pseudocounts/ i.e. quality parameter: 200
2024-02-21 10:48:25,488 [INFO] Spaces in the genomes and abundances files are processed as '_' characters if exist
2024-02-21 10:48:25,489 [INFO] Total of relative abundance values is 100.0, not 1.
2024-02-21 10:48:25,489 [INFO] Continuing, normalising total of genome abundances to 1.
2024-02-21 10:48:25,616 [INFO] Working on genome 1 of 4
2024-02-21 10:48:25,617 [INFO] Using bowtie2 to align primers to genome MTB-Oman-321769-L1
Building a SMALL index
Renaming temp/indices/1.3.bt2.tmp to temp/indices/L1.3.bt2
Renaming temp/indices/L1.4.bt2.tmp to temp/indices/L1.4.bt2
Renaming temp/indices/L1.1.bt2.tmp to temp/indices/L1.1.bt2
Renaming temp/indices/L1.2.bt2.tmp to temp/indices/L1.2.bt2
Renaming temp/indices/L1.rev.1.bt2.tmp to temp/indices/L1.rev.1.bt2
Renaming temp/indices/L1.rev.2.bt2.tmp to temp/indices/L1.rev.2.bt2
Traceback (most recent call last):
File "simulate_metagenome.py", line 344, in
File "SWAMPy/src/create_amplicons.py", line 26, in align_primers
df = pd.read_csv(alignment, sep="\t", skiprows=[0, 1, 2], header=None, names=[i for i in range(19)])
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
return _read(filepath_or_buffer, kwds)
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
return parser.read(nrows)
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
) = self._engine.read( # type: ignore[attr-defined]
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 114, saw 20
I fixed it by changing line 26 in create_amplicons
`df = pd.read_csv(alignment, sep="\t", skiprows=[0, 1, 2], header=None, names=[i for i in range(20)])`
Then I got another error
Traceback (most recent call last):
File "SWAMPy/src/simulate_metagenome.py", line 365, in
Is this because I did not provide the amplicon_distribution file?
It seems that the amplicon fasta files were created in the temp folder.
Is this normal to have empty amplicon.fasta files?
all_seqs.fasta has 4 contigs, each size of about 4 MBP genome_abundances look like this L1 30 L2 40 L3 20 L4 10 primer file is the same as the format you provided in fastq and bed format
Dear Maryam,
Not only the amplicon distribution file but also other steps are necessary to add your custom amplicon set. Please carefully follow #2 Empty amlicon files is not normal. I see that you ran your command with --primer_set a1
, which is artic v1 primer set. That wouldn't align to your organism, hence no amplicons. You should use your custom amplicon set for your organism.
I did follow the instruction and replaced the files with my amplicons
Then I don't know what the problem is, sorry! I can only suggest the obvious, debugging :) You got
pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 114, saw 20
so you can start by looking at line 114, how is it different from line 113, why is it 20 fields...
Dear authors!
Good day to you. I was wondering how I can use SWAMPy for other pathogens? Which parts of the code should I modify? I have different reference genomes and primer sets. Is there any parts of the code that uses Covid reference genome as default?
Regards, Maryam