lepbase / easy-import

Code to simplify import of data into Ensembl databases
MIT License
2 stars 3 forks source link

Repairing gff #1

Open pachiras opened 4 years ago

pachiras commented 4 years ago

First of all, let me say thank you for sharing the great project!

But I'm having a hard time importing a new genome. I found my gff needs to be modified before the import. So I added some lines in the ini file.

[GFF]
[FILES]
         SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
         GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
         ; /import/src/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
         ; mv temp.gff3 FES_r1.0.genes.gff3
         PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]

Here, the script for gff modification (/import/src/modify_gff3.sh) was as follows:

#!/usr/bin/env bash

cat $1 | awk -F'\t' 'BEGIN {
    OFS = "\t"
}
{
    if ($3 == "gene") {
        print # print gene

        split($9, GWORDS, ";")
        PARENT = GWORDS[1]
        gsub("ID=", "Parent=", PARENT)
        gsub("gene", "transcript", $9)

        $3 = "mRNA"
        $9 = $9";"PARENT
    } else {
        gsub("gene", "transcript", $9);
    }

    print
}

I checked the modified gff by this script was imported correctly.

The script was shared with the container by the docker command option:

docker run --rm -u $UID:$GROUPS --name easy-import-fagopyrum_esculentum_fes1_core_1_1 --network docker_genomehubs-network -v /data/nas1.1/genomehubs/v1/import/conf:/import/conf  -v /data/nas1.1/genomehubs/v1/import/data:/
import/data -v /data/nas1.1/genomehubs/v1/import/src:/import/src -e DATABASE=fagopyrum_esculentum_fes1_core_1_1 -e FLAGS="-s -p" genomehubs/easy-import:19.05

But the command output many warnings and the script didn't seem to be working. Am I doing wrong?

Any comment would be appreciated. Thank you.

rjchallis commented 4 years ago

I think I can see what the problem is here. The script that you want to run on the input GFF won't be run by the import container. The lines:

         ; /import/src/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
         ; mv temp.gff3 FES_r1.0.genes.gff3

are treated as comments. Where I've used this in examples, it is meant to link any notes on steps that had to be run manually to the configuration file, but there is no code in the container to actually run such commands.

If you run your script on the downloaded file (which will be in a subdirectory of /data/nas1.1/genomehubs/v1/import/data) then rerun the import, it should skip the download (as the file already exists locally) and use your modified file. You will also need to be sure to delete any other files with this filename as a prefix (e.g. *.sorted.gff) to make sure it resumes the import process from the new file and not a processed version of the old file.

If you are still having problems could you share a small portion of your GFF and I can try to see what else will help. It looks like the changes you are making in your script could probably be handled by options to the [GFF] configuration, but I realise that this doesn't use the most intuitive syntax.

I should also note that the import will usually generate quite a few warnings about the sequence names not being valid accessions that can be ignored. If you could show me any other error messages that you get then that will help with working out how best to get the file imported.

pachiras commented 4 years ago

Hi @rjchallis,

Thank you for your detailed comment. It was my misunderstanding. I thought the command after the semi colons are executed in the process.

I still want some kind of preprocessing mechanism in the import process since:

So I have been testing some experimental code in https://github.com/pachiras/easy-import/tree/preprocessing

Firstly I prepared the ini code:

[DATABASE_CORE]
        NAME = fagopyrum_esculentum_fes1_core_1_1
[META]
        SPECIES.PRODUCTION_NAME = fagopyrum_esculentum_fes1
        SPECIES.SCIENTIFIC_NAME = Fagopyrum esculentum
        SPECIES.COMMON_NAME = buckwheat
        SPECIES.DISPLAY_NAME = Genus species
        SPECIES.DIVISION = EnsemblMetazoa
        SPECIES.URL = Genus_species_asm
        SPECIES.TAXONOMY_ID = 1
        SPECIES.ALIAS = [ ]                   
        ASSEMBLY.NAME = Assembly.name
        ASSEMBLY.DATE = 2018-30-10
        ASSEMBLY.ACCESSION = 
        ASSEMBLY.DEFAULT = Assembl.name
        PROVIDER.NAME = Provider name
        PROVIDER.URL = http://example.com      
        GENEBUILD.ID = 1
        GENEBUILD.START_DATE = 2018-10
        GENEBUILD.VERSION = 1
        GENEBUILD.METHOD = import
[GFF]
[FILES]
        SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
        GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
        PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
        COMMAND1 = /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
        COMMAND2 = mv temp.gff3 FES_r1.0.genes.gff3
[GENE_STABLE_IDS]
        GFF = [ gene->ID /(.+)/ /gene:// ]
[TRANSCRIPT_STABLE_IDS]
        GFF = [ SELF->ID /(.+)/ /transcript:// ]
[TRANSLATION_STABLE_IDS]
        GFF = [ SELF->ID /(.+)/ /transcript:// ]
[MODIFY]
        OVERWRITE_DB = 1
        TRUNCATE_SEQUENCE_TABLES = 1
        TRUNCATE_GENE_TABLES = 1
        INVERT_PHASE = 1
        AUTO_PHASE = 1

Here, I added PREPROCESSING section.

And then I modified core/prepare_gff.pl to execute the commands:

## preprocess the download/obtain files using the commands in the ini file
foreach my $subsection (sort keys %{$params->{'PREPROCESSING'}}){
    preprocess_files($params->{'PREPROCESSING'}{$subsection});
}

The function preprocess_files was defined in modules/EasyImport/Core.pm

sub preprocess_files {
    my ($command);
    $command = shift;
    system "$command";
}

Cloning those modified files into a docker container, I tested the code from the inside of container:

eguser@cf5a689d91ee:/import$ perl /ensembl/easy-import/core/prepare_gff.pl /import/conf/default.ini /import/conf/overwrite.ini /import/
conf/fagopyrum_esculentum_fes1_core_1_1.ini
--2020-02-04 00:57:28--  ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz
=> 'FES_r1.0.genes.gff3.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.genes.gff3.gz ... 25416389
==> PASV ... done.    ==> RETR FES_r1.0.genes.gff3.gz ... done.
Length: 25416389 (24M) (unauthoritative)

FES_r1.0.genes.gff3.gz            100%[============================================================>]  24.24M  7.95MB/s    in 3.0s

2020-02-04 00:57:31 (7.95 MB/s) - 'FES_r1.0.genes.gff3.gz' saved [25416389]

--2020-02-04 00:57:32--  ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz
=> 'FES_r1.0.pep.fa.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.pep.fa.gz ... 41372522
==> PASV ... done.    ==> RETR FES_r1.0.pep.fa.gz ... done.
Length: 41372522 (39M) (unauthoritative)

FES_r1.0.pep.fa.gz                100%[============================================================>]  39.46M  6.90MB/s    in 5.7s

2020-02-04 00:57:39 (6.88 MB/s) - 'FES_r1.0.pep.fa.gz' saved [41372522]

--2020-02-04 00:57:40--  ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz
=> 'FES_r1.0.genome.fa.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.genome.fa.gz ... 278644757
==> PASV ... done.    ==> RETR FES_r1.0.genome.fa.gz ... done.
Length: 278644757 (266M) (unauthoritative)

FES_r1.0.genome.fa.gz             100%[============================================================>] 265.74M  5.86MB/s    in 31s

2020-02-04 00:58:12 (8.45 MB/s) - 'FES_r1.0.genome.fa.gz' saved [278644757]

WARNING: Fes_sc0000543.1.g000011.gia.1 cds coordinates (44877, 45821) do not match provided protein
WARNING: Fes_sc0001485.1.g000007.gia.1 cds coordinates (29528, 29922) do not match provided protein
WARNING: Fes_sc0003521.1.g000002.gia.1 cds coordinates (3932, 4261) do not match provided protein
WARNING: Fes_sc0005272.1.g000006.gia.1 cds coordinates (24276, 24325) do not match provided protein
WARNING: Fes_sc0005539.1.g000005.gia.1 cds coordinates (22002, 22410) do not match provided protein
WARNING: Fes_sc0007369.1.g000002.gia.1 cds coordinates (12940, 13258) do not match provided protein
WARNING: Fes_sc0007394.1.g000006.gia.1 cds coordinates (29857, 30006) do not match provided protein
WARNING: Fes_sc0009818.1.g000002.gia.1 cds coordinates (9581, 9925) do not match provided protein
WARNING: Fes_sc0011505.1.g000003.gia.1 cds coordinates (12216, 12641) do not match provided protein
WARNING: Fes_sc0012848.1.g000001.gia.1 cds coordinates (152, 350) do not match provided protein
WARNING: Fes_sc0015416.1.g000003.gia.1 cds coordinates (9064, 9297) do not match provided protein
WARNING: Fes_sc0016234.1.g000003.gia.1 cds coordinates (17194, 17601) do not match provided protein
WARNING: Fes_sc0016393.1.g000001.gia.1 cds coordinates (3544, 4007) do not match provided protein
WARNING: Fes_sc0021518.1.g000002.gia.1 cds coordinates (5722, 5948) do not match provided protein

eguser@cf5a689d91ee:/import$ perl /ensembl/easy-import/core/import_gene_models.pl /import/conf/default.ini /import/conf/overwrite.ini /import/conf/fagopyrum_esculentum_fes1_core_1_1.ini
286768 genes in database
286768 transcripts in database
286768 translations in database
917376 exons in database

Although it gave 14 WARNINGs, I could say it was successful.

As for GFF itself (ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz), the file doesn't contain any mRNAs which I need for our study. My modification script modify_gff3.sh inserts mRNAs with the same size of its genes creating proper Parent-Child relations.

I tried some techniques equipped in easy-import, but I failed to get my desired result.

I would appreciate if you could give me any comment about the preprocessing.

rjchallis commented 4 years ago

Thanks for putting int the effort in to make preprocessing! I really like the idea and this looks like a very flexible implementation. I had anticipated including one-liners as comments but having them run like this is a good step towards greater reproducibility. I be happy to take a pull request for this.

One thing that would be good to resolve is that as you are referencing a script, the commands in modify_gff3.sh would be unknown to someone trying to recreate this from the ini file alone. Maybe just having the script as a Github gist with a note of its URL would be a good idea.

Only 14 warnings is good, presumably these are stop codons or non-ACGT bases that are just one character out in the file versus the database.

For the particular problem of missing mRNAs, the line:

_CONDITION_9 = [ EXPECTATION exon     hasParent transcript|mrna|mirna|trna|ncrna|rrna force ]

in the default.ini file should be creating parent transcripts. I'm not able to test this at the moment, but I wonder if you were having problems as any new rule you tried were being applied after this. If so, you should be able to create a _CONDITION_9 in your ini to overwrite this default that has mrna first (or only mrna) then the GFF processing should be able to fill in the missing mRNAs:

_CONDITION_9 = [ EXPECTATION exon     hasParent mrna force ]
pachiras commented 4 years ago

Hi @rjchallis,

Sorry, for the late reply. I am reporting the result of your advice.

This is the snippet of the original gff:

Fes_sc0000001.1 augustus_arabi  gene    560 4505    0.05    +   .   ID=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    560 645 .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 589 645 0.39    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 727 959 1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    727 959 .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 1078    1326    0.72    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    1078    1326    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 1414    1558    0.98    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    1414    1558    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 1692    1793    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    1692    1793    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2046    2138    0.91    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2046    2138    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2211    2252    0.96    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2211    2252    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2379    2489    0.8 +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2379    2489    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2605    2676    0.99    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2605    2676    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 3172    3354    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    3172    3354    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 3856    4212    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    3856    4212    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 4295    4405    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    4295    4505    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=gene:Fes_sc0000001.1.g000001.aua.1

There is no mRNA in the file. And this is the output of my modification script (modify_gff3.sh):

Fes_sc0000001.1 augustus_arabi  gene    560 4505    0.05    +   .   ID=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  mRNA    560 4505    0.05    +   .   ID=transcript:Fes_sc0000001.1.g000001.aua.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    560 645 .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 589 645 0.39    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 727 959 1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    727 959 .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 1078    1326    0.72    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    1078    1326    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 1414    1558    0.98    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    1414    1558    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 1692    1793    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    1692    1793    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2046    2138    0.91    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2046    2138    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2211    2252    0.96    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2211    2252    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2379    2489    0.8 +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2379    2489    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 2605    2676    0.99    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    2605    2676    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 3172    3354    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    3172    3354    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 3856    4212    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    3856    4212    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  CDS 4295    4405    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi  exon    4295    4505    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=transcript:Fes_sc0000001.1.g000001.aua.1

Here, a mRNA with the same size of gene was inserted and CDSs and eons belonged to it.

Finally, the result after setting the condition

_CONDITION_9 = [ EXPECTATION exon     hasParent mrna force ]

in the ini file (fagopyrum_esculentum_fes1_core_1_1.ini):

Fes_sc0000001.1 augustus_arabi  gene    560 4505    0.05    +   .   ID=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 GFFTree mrna    1078    1326    .   +   .   ID=GFFTree_mrna670;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna670;translation_stable_id=GFFTree_mrna670
Fes_sc0000001.1 augustus_arabi  CDS 1078    1326    0.72    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=GFFTree_mrna670
Fes_sc0000001.1 augustus_arabi  exon    1078    1326    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=GFFTree_mrna670
Fes_sc0000001.1 GFFTree mrna    1414    1558    .   +   .   ID=GFFTree_mrna671;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna671;translation_stable_id=GFFTree_mrna671
Fes_sc0000001.1 augustus_arabi  CDS 1414    1558    0.98    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=GFFTree_mrna671
Fes_sc0000001.1 augustus_arabi  exon    1414    1558    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=GFFTree_mrna671
Fes_sc0000001.1 GFFTree mrna    1692    1793    .   +   .   ID=GFFTree_mrna672;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna672;translation_stable_id=GFFTree_mrna672
Fes_sc0000001.1 augustus_arabi  CDS 1692    1793    1   +   2   ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=GFFTree_mrna672
Fes_sc0000001.1 augustus_arabi  exon    1692    1793    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=GFFTree_mrna672
Fes_sc0000001.1 GFFTree mrna    2046    2138    .   +   .   ID=GFFTree_mrna673;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna673;translation_stable_id=GFFTree_mrna673
Fes_sc0000001.1 augustus_arabi  CDS 2046    2138    0.91    +   2   ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=GFFTree_mrna673
Fes_sc0000001.1 augustus_arabi  exon    2046    2138    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=GFFTree_mrna673
Fes_sc0000001.1 GFFTree mrna    2211    2252    .   +   .   ID=GFFTree_mrna674;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna674;translation_stable_id=GFFTree_mrna674
Fes_sc0000001.1 augustus_arabi  CDS 2211    2252    0.96    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=GFFTree_mrna674
Fes_sc0000001.1 augustus_arabi  exon    2211    2252    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=GFFTree_mrna674
Fes_sc0000001.1 GFFTree mrna    2379    2489    .   +   .   ID=GFFTree_mrna675;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna675;translation_stable_id=GFFTree_mrna675
Fes_sc0000001.1 augustus_arabi  CDS 2379    2489    0.8 +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=GFFTree_mrna675
Fes_sc0000001.1 augustus_arabi  exon    2379    2489    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=GFFTree_mrna675
Fes_sc0000001.1 GFFTree mrna    2605    2676    .   +   .   ID=GFFTree_mrna676;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna676;translation_stable_id=GFFTree_mrna676
Fes_sc0000001.1 augustus_arabi  CDS 2605    2676    0.99    +   1   ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=GFFTree_mrna676
Fes_sc0000001.1 augustus_arabi  exon    2605    2676    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=GFFTree_mrna676
Fes_sc0000001.1 GFFTree mrna    3172    3354    .   +   .   ID=GFFTree_mrna677;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna677;translation_stable_id=GFFTree_mrna677
Fes_sc0000001.1 augustus_arabi  CDS 3172    3354    1   +   2   ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=GFFTree_mrna677
Fes_sc0000001.1 augustus_arabi  exon    3172    3354    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=GFFTree_mrna677
Fes_sc0000001.1 GFFTree mrna    3856    4212    .   +   .   ID=GFFTree_mrna678;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna678;translation_stable_id=GFFTree_mrna678
Fes_sc0000001.1 augustus_arabi  CDS 3856    4212    1   +   2   ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=GFFTree_mrna678
Fes_sc0000001.1 augustus_arabi  exon    3856    4212    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=GFFTree_mrna678
Fes_sc0000001.1 GFFTree mrna    4295    4405    .   +   .   ID=GFFTree_mrna679;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna679;translation_stable_id=GFFTree_mrna679
Fes_sc0000001.1 augustus_arabi  CDS 4296    4405    1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=GFFTree_mrna679
Fes_sc0000001.1 GFFTree mrna    589 645 .   +   .   ID=GFFTree_mrna680;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna680;translation_stable_id=GFFTree_mrna680
Fes_sc0000001.1 augustus_arabi  CDS 590 645 0.39    +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=GFFTree_mrna680
Fes_sc0000001.1 GFFTree mrna    727 959 .   +   .   ID=GFFTree_mrna681;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna681;translation_stable_id=GFFTree_mrna681
Fes_sc0000001.1 augustus_arabi  CDS 727 959 1   +   0   ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=GFFTree_mrna681
Fes_sc0000001.1 augustus_arabi  exon    727 959 .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=GFFTree_mrna681
Fes_sc0000001.1 GFFTree mrna    4295    4505    .   +   .   ID=GFFTree_mrna682;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna682;translation_stable_id=GFFTree_mrna682
Fes_sc0000001.1 augustus_arabi  exon    4295    4505    .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=GFFTree_mrna682
Fes_sc0000001.1 GFFTree mrna    560 645 .   +   .   ID=GFFTree_mrna683;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna683;translation_stable_id=GFFTree_mrna683
Fes_sc0000001.1 augustus_arabi  exon    560 645 .   +   .   ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=GFFTree_mrna683
###

Here, mRNA was created for each exon, which creates a lot of WARNINGS in the validation process. Removing _CONDITION_9 and using only default.ini gave, basically, the same result. So, I will go with my script for a while. Sorry but thank you for your advice.


As for the preprocessing mechanism, I am glad you like it. I have created a pull request.

But before merging the code, we have to consider one thing. I have inserted the preprocessing code only in easy-import/core/prepare_gff.pl. So some user could skip the preparation process and import the data directly. Or some user could try to execute the script multiple times, which destroy the gff.

Do you have any idea about about avoiding those happen?

rjchallis commented 4 years ago

Thanks for the detailed report, sorry the built in methods didn't work but good that it can be solved by preprocessing.

I don't think the first you raised will be too much of a problem in practice. Skipping the prepare step also means that the built in methods would not be applied so the preprocessing step is no different here. If the GFF needs fixing it should generate enough warnings/errors to indicate something is wrong.

As for running the script multiple times, this could be a problem and I'm not sure of the best solution. I like that preprocessing is generic so it could be used to update sequence file headers as well/instead of gff, but that complicates finding a way to test if the file has been changed. One idea would be to calculate checksums for each GFF/FASTA infile before preprocessing and write them to disk after preprocessing:

Of course having just typed the checksum version I'm realising that this could be simplified to just touch a file to flag that preprocessing has been done.

pachiras commented 4 years ago

I like the idea of checksum. In that case, we'd better to change the format of ini, like,

[FILES]
        SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
        GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
        PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
        GFF = [ /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3; mv temp.gff3 FES_r1.0.genes.gff3 ]
        SCAFFOLD = [ ...(Scaffold modification commands)... ]

Since this format clarifies the target file for each preprocess, it is easier for the program to determine which checksum should be checked before executing the command.

Or you could represent the target file by {}

[FILES]
        SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
        GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
        PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
        GFF = [ /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh {} > temp.gff3; mv temp.gff3 {} ]
        SCAFFOLD = [ ...(Scaffold modification commands)... ]
rjchallis commented 4 years ago

Thanks, reformatting the ini makes sense to associate files with appropriate checksums.

I prefer your second suggestion using {} to represent the file.

pachiras commented 4 years ago

OK. I'll think about the implementation and will renew the pull request. But it could take a few weeks because I'm kind of busy now...