nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
159 stars 49 forks source link

mask_gubbins_aln.py not removing recombinant regions #385

Closed erinpnewcomer closed 7 months ago

erinpnewcomer commented 9 months ago

Hi!

I'm trying to create a masked core genome alignment so I can see what % of the core is getting masked by Gubbins, but the script mask_gubbins_aln.py doesn't seem to be making any changes to the input core.aln. The recombination_predictions.gff file has 1501 lines of predictions. Has anyone else encountered this/any ideas on how to fix this?

nickjcroucher commented 8 months ago

Are there any unusual characters in your isolate names (e.g. "#"?)

sylarKYG commented 6 months ago

Are there any unusual characters in your isolate names (e.g. "#"?)

Same issue with Erin. The mask aln is as same large as input aln. The isolate name has "." and "_".

nickjcroucher commented 6 months ago

What version are you using, and what is the command you are running?

sylarKYG commented 6 months ago

Python: 3.11.7 Biopython: 1.82 The command: python3.11 /data4/CLC_data4/shiqiucheng/software/mask_gubbins_aln.py --aln clean.full.aln --gff clean.full.recombination_predictions.gff --out clean.full.mask.aln recombination_predictions.gff file has 1457 lines of predictions.

nickjcroucher commented 6 months ago

Thanks - what version of Gubbins?

sylarKYG commented 6 months ago

gubbins 2.4.1

nickjcroucher commented 6 months ago

Try upgrading to the latest version (3.3.2)

sylarKYG commented 6 months ago

How does the recombination removal fragment show in the alignment file, replace by "_" or "N"? Because byte count of clean.full.aln and clean.mask.aln are identical, but the counts of ATCG are different.

nickjcroucher commented 6 months ago

Whatever you set the missing character to - by default it is - - this is documented in the script's help.