single- vs. multi-copy genes in align module

BrigidaGallone commented 1 year ago

Dear Daniel,

After the extraction of the core genes I manipulated the ucg files to get counts for each gene in order to understand which genes are still single-copy and can be carried along for phylogenetic analyses. This is surprisingly variable, partially biologically and partially technical (assembly-related) but the majority of the genes are indeed single-copy. It would be super useful to be able to extract all the unaligned sequences as fasta files so the user can make decisions on what to bring to the next stage. Currently, using the module "align" all the genes for all the species are included in one single alignment (including multi copy genes), is this correct?

E.g. the MIP2 gene. I searched in 54 species. In 48 species, it was single copy, in 2 species it was missing and in 4 species it was duplicated. My MIP2 alignment includes 52 species, so I assumed the 48 + 4 but I cannot see any label for the duplicated species which appear in there once. Can you tell me how the genes are carried along and labeled?

Gene Category (missing, 1 copy, 2 copies) Nr. species Freq MIP1 0 2 0.03703704 MIP1 1 48 0.88888889 MIP1 2 4 0.07407407

Thank you in advance your help!

endixk commented 1 year ago

Dear Brigida,

The pipeline currently aligns the sequence with the highest score (lowest e-value) when multiple copies are detected. This is a legacy from the earliest build of my pipeline, which should be rectified. Thank you for pointing this out.

For now, you will have to parse the JSON-formatted .ucg file and identify/remove the genes with multiple hits. This won't be too difficult if you are familiar with any sort of JSON parsing libraries (Python import json, Java org.json etc.). You can still manually remove the entries in text editors if this is not the case.

Since this makeshift measure is mostly tedious, a method that automatically rejects multiple copied genes during the alignment process will be included in the next release.

Please feel free to ask any further questions you may have.

BrigidaGallone commented 1 year ago

Thanks a lot for the clarification Daniel!

endixk commented 1 year ago

@BrigidaGallone This feature now has been introduced from the newest version v1.0.3. Will be appreciated if you could test it out!

BrigidaGallone commented 1 year ago

Hello,

Great! Thanks a lot, I will test it and keep you posted.

steineggerlab / ufcg

single- vs. multi-copy genes in align module #4