legumeinfo / pandagma

Generate pan-gene sets, given a collection of genome assemblies and corresponding gene models.
BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

canonical.CDS for PanZeav2 #15

Closed JBerthelier closed 6 days ago

JBerthelier commented 1 week ago

Dear Pandagma author,

I am very interested by your tool and I would like to reproduce your PanZeav2 data set.

Checking at the configuration files (), I see that you are using .canonical.cds.fa .

However, these files are missing for many Zea lines in https://download.maizegdb.org/, but the only the file with all cds is available. For instance for https://download.maizegdb.org/Zd-Gigi-REFERENCE-PanAnd-1.0/ , there is only Zd-Gigi-REFERENCE-PanAnd-1.0_Zd00001aa.1.cds.fa.gz available

Could you let me know what process did you apply to get the canonical.cds.fa from the all.cds.fa ?

Thank you

StevenCannon-USDA commented 1 week ago

Hi Jérémy,

Thanks for the interest and feedback.

I am tagging @ekcannon here, as she is the one who has been working with maize pangene data - and she manages data (and much of the code) at maizegdb.org. You can follow up with her if you have questions about specific files. She can tell you about pitfalls or quirks with the various annotation sets, for example.

In general though, I think those "canonical" versions were derived from the cds files using one of two utility scripts that I have just now added to pandagma (in bin/).

The script longest_variant_from_fasta.sh can derive a file of longest transcripts from a multifasta file in which the splice variants are indicated by a dot-separated final string in the cds/mRNA/protein ID -- for example, .1 or .m1

The script longest_variant_from_gff.pl can be used when the splice variants don't have a trivial relationship to the parent gene identifier, but where the parent-child relationship is given in the GFF. In this case, use longest_variant_from_gff.pl to identify a list of longest transcripts; then use the list with get_fasta_subset.pl to extract those sequences from the fasta file.

In most cases, longest_variant_from_fasta.sh is sufficient and preferable. I think this is the script that was used for the maize files -- though I am not 100% sure.

JBerthelier commented 1 week ago

Dear Steven, Thank you for your quick reply and adding your scripts, I will try both of them. I will contact ekcannon for more informations. Best,