Annotation with user-provided CDS fasta sequences

seajane commented 7 months ago

I am trying to create a PPanGGOLiN pangenome using annotations from another source. I have gff3 files and matched fasta. I am running version ppanggolin 2.0.2. I used this command: ppanggolin annotate --anno gffdf.list --fasta eggfast.list. I received this error

seajane commented 7 months ago

I upgraded to 2.0.4, just in case this helped and received the same error. Here it is in more detail:

Traceback (most recent call last):
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/lib/python3.10/site-packages/ppanggolin/annotate/annotate.py", line 578, in get_gene_sequences_from_fastas
    gene.add_sequence(get_dna_sequence(fasta_dict[org][contig.name], gene))
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/lib/python3.10/site-packages/ppanggolin/annotate/synta.py", line 306, in get_dna_sequence
    return reverse_complement(contig_seq[gene.start - 1:gene.stop])
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/lib/python3.10/site-packages/ppanggolin/annotate/synta.py", line 46, in reverse_complement
    rcseq += complement[i]
KeyError: 'L'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/lib/python3.10/site-packages/ppanggolin/main.py", line 177, in main
    ppanggolin.annotate.launch(args)
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/lib/python3.10/site-packages/ppanggolin/annotate/annotate.py", line 670, in launch
    get_gene_sequences_from_fastas(pangenome, args.fasta)
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/lib/python3.10/site-packages/ppanggolin/annotate/annotate.py", line 586, in get_gene_sequences_from_fastas
    raise KeyError(msg)
KeyError: 'Fasta file for genome G_NR021 did not have the contig NZ_NQOP01000002.1_1 that was read from the annotation file. The provided contigs in the fasta were : NZ_NQOP01000003.1_1, NZ_NQOP01000003.1_2, NZ_NQOP01000003.1_3, NZ_NQOP01000003.1_4, NZ_NQOP01000003.1_5, NZ_NQOP01000003.1_6, NZ_NQOP01000003.1_7, NZ_NQOP01000003.1_8, NZ_NQOP01000003.1_9, NZ_NQOP01000003.1_10, NZ_NQOP01000003.1_11, NZ_NQOP01000003.1_12, NZ_NQOP01000003.1_13, NZ_NQOP01000003.1_14, NZ_NQOP01000003.1_15, NZ_NQOP01000003.1_16, NZ_NQOP01000003.1_17, NZ_NQOP01000003.1_18, NZ_NQOP01000003.1_19, NZ_NQOP01000003.1_20, NZ_NQOP01000003.1_21, NZ_NQOP01000003.1_22, NZ_NQOP01000003.1_23, NZ_NQOP01000003.1_24, NZ_NQOP01000003.1_25, NZ_NQOP01000003.1_26, NZ_NQOP01000003.1_27, NZ_NQOP01000003.1_28, NZ_NQOP01000003.1_29, NZ_NQOP01000003.1_30, NZ_NQOP01000003.1_31, NZ_NQOP01000003.1_32, NZ_NQOP01000003.1_33, NZ_NQOP01000003.1_34, NZ_NQOP01000003.1_35, NZ_NQOP01000003.1_36, NZ_NQOP01000003.1_37, NZ_NQOP01000003.1_38, NZ_NQOP01000003.1_39, NZ_NQOP01000003.1_40, NZ_NQOP01000003.1_41, NZ_NQOP01000003.1_42, NZ_NQOP01000003.1_43, NZ_NQOP01000003.1_44, NZ_NQOP01000003.1_45, NZ_NQOP01000003.1_46, NZ_NQOP01000003.1_47, NZ_NQOP01000003.1_48, NZ_NQOP01000003.1_49, NZ_NQOP01000003.1_50, NZ_NQOP01000003.1_51, NZ_NQOP01000003.1_52, NZ_NQOP01000003.1_53, NZ_NQOP01000003.1_54, NZ_NQOP01000003.1_55, NZ_NQOP01000003.1_56, NZ_NQOP01000003.1_57, NZ_NQOP01000003.1_58, NZ_NQOP01000003.1_59, NZ_NQOP01000003.1_60, NZ_NQOP01000003.1_61, NZ_NQOP01000003.1_62, NZ_NQOP01000003.1_63, NZ_NQOP01000003.1_64, NZ_NQOP01000003.1_65, NZ_NQOP01000003.1_66, NZ_NQOP01000003.1_67, NZ_NQOP01000003.1_68, NZ_NQOP01000003.1_69, NZ_NQOP01000003.1_70, NZ_NQOP01000003.1_71, NZ_NQOP01000003.1_72, NZ_NQOP01000003.1_73, NZ_NQOP01000003.1_74, NZ_NQOP01000003.1_75, NZ_NQOP01000003.1_76, NZ_NQOP01000003.1_77, NZ_NQOP01000003.1_78, NZ_NQOP01000003.1_79, NZ_NQOP01000003.1_80, NZ_NQOP01000003.1_81, NZ_NQOP01000003.1_82, NZ_NQOP01000003.1_83, NZ_NQOP01000003.1_84, NZ_NQOP01000002.1_1

seajane commented 7 months ago

Now that I see the whole error, I believe one problem lies in fasta type. PPanGGOLiN is expecting DNA files (I assume, from the 'get_dna_Sequence' and 'reverse complement' function that failed.) Is there anyway to use AA sequences?

seajane commented 7 months ago

I converted all my AA sequences using both degenerate bases and a random trinucleotide code for each amino acid. The final error above is the same on the bottom, the initial error has switched to

Traceback (most recent call last):
  File "/Users/hbouzek/opt/anaconda3/envs/ppgg-new2/lib/python3.10/site-packages/ppanggolin/annotate/annotate.py", line 578, in get_gene_sequences_from_fastas
    gene.add_sequence(get_dna_sequence(fasta_dict[org][contig.name], gene))
KeyError: 'NZ_NQOP01000002.1_1'

axbazin commented 7 months ago

Hi,

So I'm not quite sure what is happenning with the code, but ppanggolin deals with genomic dna fasta files and nothing else. It does not expect to work with AA or CDS fasta files.

With what you want to do, if you really want to use PPanGGOLiN I'd recommend to use the path of providing your own clustering, using this option: https://ppanggolin.readthedocs.io/en/latest/user/PangenomeAnalyses/pangenomeAnalyses.html#providing-your-gene-families

Along with the gff3 files that you are already using. If you are looking for a tool for the clustering, PPanGGOLiN uses MMseqs2 internally, so you can probably have a go with that.

I agree that in your case where gff3 files are provided things could work with AA or CDS fasta sequences in theory, but we have not gone this path.

Adelme

labgem / PPanGGOLiN

Annotation with user-provided CDS fasta sequences #197