goldman-gp-ebi / SWAMPy

Other
10 stars 5 forks source link

Illumina error profile is flipped? #8

Open hoelzer opened 1 month ago

hoelzer commented 1 month ago

Hey,

thanks for the great simulation tool! We simulated reads using this command:

conda activate SWAMPy
python simulate_metagenome.py \
    --genomes_file ../example/viralref.fasta \
    --temp_folder ../example/${sample}_temp \
    --genome_abundances ../example/${sample}.tsv \
    --primer_set a4 \
    --output_folder ../eebg24_highq \
    --output_filename_prefix ${sample} \
    --n_reads  1000000 \
    --seqSys MSv3 \
    --read_length 250 \
    --seed 10 \
    --amplicon_distribution  dirichlet_1 \
    --amplicon_pseudocounts 200 \
    --unique_insertion_rate 0.00002 \
    --unique_deletion_rate 0.000115 \
    --unique_substitution_rate 0.002485 \
    --recurrent_insertion_rate 0.00002 \
    --recurrent_deletion_rate 0 \
    --recurrent_substitution_rate 0.003357 \
    --max_insertion_length 14 \
    --subs_VAF_alpha 0.36,0.27 \
    --del_VAF_alpha 0.59,0.42 \
    --ins_VAF_alpha 0.33,0.45 \
    --r_subs_VAF_alpha 0.36,0.27 \
    --r_del_VAF_alpha 0.59,0.42 \
    --r_ins_VAF_alpha 0.33,0.45 

but when we check the quality pattern with FASTQC we see such profiles for the paired-end reads:

Forward read

Screenshot 2024-07-10 at 18 04 31

Reverse read

Screenshot 2024-07-10 at 18 05 28

Usually, the read quality drops towards the 5' end and not the 3' end (see https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). So it looks flipped? Although the general error profile looks quite good otherwise!

Did we do something wrong in the simulation, or is this a bug?

(we see the same pattern for all reads we simulated)

Thanks!

rabiafidan commented 1 month ago

Hi! Thank you for reporting this! Have you also done any art_illumina simulation by any chance? Are they also flipped? Since Swampy uses art_illumina amplicon mode for sequencing bit, there is a chance that the bug is inherited from them. Otherwise, it is about how we handle the reads later on and we need to investigate further.

willboulton commented 1 month ago

Hi - thanks for your very clear report and investigation! I think this is a bug that might be inherited from art_illumina (which we're using to simulate read errors) - I'll need to investigate this a bit more just to confirm.

One short term fix - though not ideal - might be to use --seqSys MSv1 (i.e. the ART built-in profile for MiSeq version 1) - though this would have higher error rates than the version 3, which just from inspection seemed to be the right way around. If you thought those read error rates were too high, you could also pair this with using --qshift 10 to bump up the quality scores to be more comparible to the MiSeq v3.

I'll take a look at a longer term fix as well - which would be just to pass in custom (reversed) quality score profiles used by ART - and see if I can confirm that this is indeed a bug with the authors of ART, though that might take a little longer.

hoelzer commented 1 month ago

Hey, thanks for the fast responses and the potential workaround! Yes, we also think that's a problem in art_illumina. We ran

art_illumina --amplicon --quiet --paired --rndSeed 10 --noALN --maskN 0 --seqSys MSv3 --in genomes.fasta --len 250 --rcount 1000000 --out out_prefix

which results in:

Screenshot 2024-07-12 at 12 49 48

I'll take a look at a longer term fix as well - which would be just to pass in custom (reversed) quality score profiles used by ART - and see if I can confirm that this is indeed a bug with the authors of ART, though that might take a little longer.

Great, thanks!

willboulton commented 1 month ago

I've contacted the authors of ART to see if I can confirm that this quality profile is really reversed - which would be the simplest explanation for what's happened.

If you are happy to be a guinea-pig for the above fix, which is simply to take the quality profiles and reverse them, and pass them in as custom profiles, then you could use the following parameters --seqSys custom and --qprof1 and --qprof2 and point them to these two files (which then just get passed to ART).

MiSeqv3L250R1_reversed.txt MiSeqv3L250R2_reversed.txt

I won't make this the default behaviour until I can get confirmation about this being a real bug in ART - then I'll let you know here. But for now you can pull the latest version of the code and try again using these parameters. Hope this helps!

hoelzer commented 1 month ago

Great, thx @willboulton !

I forwarded that information to Justin, who also did the previous analysis I shared.