Open pachiras opened 4 years ago
I think I can see what the problem is here. The script that you want to run on the input GFF won't be run by the import container. The lines:
; /import/src/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
; mv temp.gff3 FES_r1.0.genes.gff3
are treated as comments. Where I've used this in examples, it is meant to link any notes on steps that had to be run manually to the configuration file, but there is no code in the container to actually run such commands.
If you run your script on the downloaded file (which will be in a subdirectory of /data/nas1.1/genomehubs/v1/import/data
) then rerun the import, it should skip the download (as the file already exists locally) and use your modified file. You will also need to be sure to delete any other files with this filename as a prefix (e.g. *.sorted.gff) to make sure it resumes the import process from the new file and not a processed version of the old file.
If you are still having problems could you share a small portion of your GFF and I can try to see what else will help. It looks like the changes you are making in your script could probably be handled by options to the [GFF]
configuration, but I realise that this doesn't use the most intuitive syntax.
I should also note that the import will usually generate quite a few warnings about the sequence names not being valid accessions that can be ignored. If you could show me any other error messages that you get then that will help with working out how best to get the file imported.
Hi @rjchallis,
Thank you for your detailed comment. It was my misunderstanding. I thought the command after the semi colons are executed in the process.
I still want some kind of preprocessing mechanism in the import process since:
So I have been testing some experimental code in https://github.com/pachiras/easy-import/tree/preprocessing
Firstly I prepared the ini code:
[DATABASE_CORE]
NAME = fagopyrum_esculentum_fes1_core_1_1
[META]
SPECIES.PRODUCTION_NAME = fagopyrum_esculentum_fes1
SPECIES.SCIENTIFIC_NAME = Fagopyrum esculentum
SPECIES.COMMON_NAME = buckwheat
SPECIES.DISPLAY_NAME = Genus species
SPECIES.DIVISION = EnsemblMetazoa
SPECIES.URL = Genus_species_asm
SPECIES.TAXONOMY_ID = 1
SPECIES.ALIAS = [ ]
ASSEMBLY.NAME = Assembly.name
ASSEMBLY.DATE = 2018-30-10
ASSEMBLY.ACCESSION =
ASSEMBLY.DEFAULT = Assembl.name
PROVIDER.NAME = Provider name
PROVIDER.URL = http://example.com
GENEBUILD.ID = 1
GENEBUILD.START_DATE = 2018-10
GENEBUILD.VERSION = 1
GENEBUILD.METHOD = import
[GFF]
[FILES]
SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
COMMAND1 = /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
COMMAND2 = mv temp.gff3 FES_r1.0.genes.gff3
[GENE_STABLE_IDS]
GFF = [ gene->ID /(.+)/ /gene:// ]
[TRANSCRIPT_STABLE_IDS]
GFF = [ SELF->ID /(.+)/ /transcript:// ]
[TRANSLATION_STABLE_IDS]
GFF = [ SELF->ID /(.+)/ /transcript:// ]
[MODIFY]
OVERWRITE_DB = 1
TRUNCATE_SEQUENCE_TABLES = 1
TRUNCATE_GENE_TABLES = 1
INVERT_PHASE = 1
AUTO_PHASE = 1
Here, I added PREPROCESSING section.
And then I modified core/prepare_gff.pl
to execute the commands:
## preprocess the download/obtain files using the commands in the ini file
foreach my $subsection (sort keys %{$params->{'PREPROCESSING'}}){
preprocess_files($params->{'PREPROCESSING'}{$subsection});
}
The function preprocess_files
was defined in modules/EasyImport/Core.pm
sub preprocess_files {
my ($command);
$command = shift;
system "$command";
}
Cloning those modified files into a docker container, I tested the code from the inside of container:
eguser@cf5a689d91ee:/import$ perl /ensembl/easy-import/core/prepare_gff.pl /import/conf/default.ini /import/conf/overwrite.ini /import/
conf/fagopyrum_esculentum_fes1_core_1_1.ini
--2020-02-04 00:57:28-- ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz
=> 'FES_r1.0.genes.gff3.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.genes.gff3.gz ... 25416389
==> PASV ... done. ==> RETR FES_r1.0.genes.gff3.gz ... done.
Length: 25416389 (24M) (unauthoritative)
FES_r1.0.genes.gff3.gz 100%[============================================================>] 24.24M 7.95MB/s in 3.0s
2020-02-04 00:57:31 (7.95 MB/s) - 'FES_r1.0.genes.gff3.gz' saved [25416389]
--2020-02-04 00:57:32-- ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz
=> 'FES_r1.0.pep.fa.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.pep.fa.gz ... 41372522
==> PASV ... done. ==> RETR FES_r1.0.pep.fa.gz ... done.
Length: 41372522 (39M) (unauthoritative)
FES_r1.0.pep.fa.gz 100%[============================================================>] 39.46M 6.90MB/s in 5.7s
2020-02-04 00:57:39 (6.88 MB/s) - 'FES_r1.0.pep.fa.gz' saved [41372522]
--2020-02-04 00:57:40-- ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz
=> 'FES_r1.0.genome.fa.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.genome.fa.gz ... 278644757
==> PASV ... done. ==> RETR FES_r1.0.genome.fa.gz ... done.
Length: 278644757 (266M) (unauthoritative)
FES_r1.0.genome.fa.gz 100%[============================================================>] 265.74M 5.86MB/s in 31s
2020-02-04 00:58:12 (8.45 MB/s) - 'FES_r1.0.genome.fa.gz' saved [278644757]
WARNING: Fes_sc0000543.1.g000011.gia.1 cds coordinates (44877, 45821) do not match provided protein
WARNING: Fes_sc0001485.1.g000007.gia.1 cds coordinates (29528, 29922) do not match provided protein
WARNING: Fes_sc0003521.1.g000002.gia.1 cds coordinates (3932, 4261) do not match provided protein
WARNING: Fes_sc0005272.1.g000006.gia.1 cds coordinates (24276, 24325) do not match provided protein
WARNING: Fes_sc0005539.1.g000005.gia.1 cds coordinates (22002, 22410) do not match provided protein
WARNING: Fes_sc0007369.1.g000002.gia.1 cds coordinates (12940, 13258) do not match provided protein
WARNING: Fes_sc0007394.1.g000006.gia.1 cds coordinates (29857, 30006) do not match provided protein
WARNING: Fes_sc0009818.1.g000002.gia.1 cds coordinates (9581, 9925) do not match provided protein
WARNING: Fes_sc0011505.1.g000003.gia.1 cds coordinates (12216, 12641) do not match provided protein
WARNING: Fes_sc0012848.1.g000001.gia.1 cds coordinates (152, 350) do not match provided protein
WARNING: Fes_sc0015416.1.g000003.gia.1 cds coordinates (9064, 9297) do not match provided protein
WARNING: Fes_sc0016234.1.g000003.gia.1 cds coordinates (17194, 17601) do not match provided protein
WARNING: Fes_sc0016393.1.g000001.gia.1 cds coordinates (3544, 4007) do not match provided protein
WARNING: Fes_sc0021518.1.g000002.gia.1 cds coordinates (5722, 5948) do not match provided protein
eguser@cf5a689d91ee:/import$ perl /ensembl/easy-import/core/import_gene_models.pl /import/conf/default.ini /import/conf/overwrite.ini /import/conf/fagopyrum_esculentum_fes1_core_1_1.ini
286768 genes in database
286768 transcripts in database
286768 translations in database
917376 exons in database
Although it gave 14 WARNINGs, I could say it was successful.
As for GFF itself (ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz
), the file doesn't contain any mRNAs which I need for our study.
My modification script modify_gff3.sh
inserts mRNAs with the same size of its genes
creating proper Parent-Child relations.
I tried some techniques equipped in easy-import, but I failed to get my desired result.
I would appreciate if you could give me any comment about the preprocessing.
Thanks for putting int the effort in to make preprocessing! I really like the idea and this looks like a very flexible implementation. I had anticipated including one-liners as comments but having them run like this is a good step towards greater reproducibility. I be happy to take a pull request for this.
One thing that would be good to resolve is that as you are referencing a script, the commands in modify_gff3.sh
would be unknown to someone trying to recreate this from the ini file alone. Maybe just having the script as a Github gist with a note of its URL would be a good idea.
Only 14 warnings is good, presumably these are stop codons or non-ACGT bases that are just one character out in the file versus the database.
For the particular problem of missing mRNAs, the line:
_CONDITION_9 = [ EXPECTATION exon hasParent transcript|mrna|mirna|trna|ncrna|rrna force ]
in the default.ini
file should be creating parent transcripts. I'm not able to test this at the moment, but I wonder if you were having problems as any new rule you tried were being applied after this. If so, you should be able to create a _CONDITION_9
in your ini to overwrite this default that has mrna first (or only mrna) then the GFF processing should be able to fill in the missing mRNAs:
_CONDITION_9 = [ EXPECTATION exon hasParent mrna force ]
Hi @rjchallis,
Sorry, for the late reply. I am reporting the result of your advice.
This is the snippet of the original gff:
Fes_sc0000001.1 augustus_arabi gene 560 4505 0.05 + . ID=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 560 645 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 589 645 0.39 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 727 959 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 727 959 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 1078 1326 0.72 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 1078 1326 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 1414 1558 0.98 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 1414 1558 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 1692 1793 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 1692 1793 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2046 2138 0.91 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2046 2138 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2211 2252 0.96 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2211 2252 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2379 2489 0.8 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2379 2489 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2605 2676 0.99 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2605 2676 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 3172 3354 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 3172 3354 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 3856 4212 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 3856 4212 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 4295 4405 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 4295 4505 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=gene:Fes_sc0000001.1.g000001.aua.1
There is no mRNA in the file.
And this is the output of my modification script (modify_gff3.sh
):
Fes_sc0000001.1 augustus_arabi gene 560 4505 0.05 + . ID=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi mRNA 560 4505 0.05 + . ID=transcript:Fes_sc0000001.1.g000001.aua.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 560 645 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 589 645 0.39 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 727 959 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 727 959 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 1078 1326 0.72 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 1078 1326 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 1414 1558 0.98 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 1414 1558 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 1692 1793 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 1692 1793 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2046 2138 0.91 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2046 2138 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2211 2252 0.96 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2211 2252 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2379 2489 0.8 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2379 2489 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 2605 2676 0.99 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 2605 2676 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 3172 3354 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 3172 3354 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 3856 4212 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 3856 4212 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi CDS 4295 4405 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 augustus_arabi exon 4295 4505 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Here, a mRNA with the same size of gene was inserted and CDSs and eons belonged to it.
Finally, the result after setting the condition
_CONDITION_9 = [ EXPECTATION exon hasParent mrna force ]
in the ini file (fagopyrum_esculentum_fes1_core_1_1.ini
):
Fes_sc0000001.1 augustus_arabi gene 560 4505 0.05 + . ID=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1 GFFTree mrna 1078 1326 . + . ID=GFFTree_mrna670;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna670;translation_stable_id=GFFTree_mrna670
Fes_sc0000001.1 augustus_arabi CDS 1078 1326 0.72 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=GFFTree_mrna670
Fes_sc0000001.1 augustus_arabi exon 1078 1326 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=GFFTree_mrna670
Fes_sc0000001.1 GFFTree mrna 1414 1558 . + . ID=GFFTree_mrna671;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna671;translation_stable_id=GFFTree_mrna671
Fes_sc0000001.1 augustus_arabi CDS 1414 1558 0.98 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=GFFTree_mrna671
Fes_sc0000001.1 augustus_arabi exon 1414 1558 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=GFFTree_mrna671
Fes_sc0000001.1 GFFTree mrna 1692 1793 . + . ID=GFFTree_mrna672;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna672;translation_stable_id=GFFTree_mrna672
Fes_sc0000001.1 augustus_arabi CDS 1692 1793 1 + 2 ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=GFFTree_mrna672
Fes_sc0000001.1 augustus_arabi exon 1692 1793 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=GFFTree_mrna672
Fes_sc0000001.1 GFFTree mrna 2046 2138 . + . ID=GFFTree_mrna673;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna673;translation_stable_id=GFFTree_mrna673
Fes_sc0000001.1 augustus_arabi CDS 2046 2138 0.91 + 2 ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=GFFTree_mrna673
Fes_sc0000001.1 augustus_arabi exon 2046 2138 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=GFFTree_mrna673
Fes_sc0000001.1 GFFTree mrna 2211 2252 . + . ID=GFFTree_mrna674;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna674;translation_stable_id=GFFTree_mrna674
Fes_sc0000001.1 augustus_arabi CDS 2211 2252 0.96 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=GFFTree_mrna674
Fes_sc0000001.1 augustus_arabi exon 2211 2252 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=GFFTree_mrna674
Fes_sc0000001.1 GFFTree mrna 2379 2489 . + . ID=GFFTree_mrna675;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna675;translation_stable_id=GFFTree_mrna675
Fes_sc0000001.1 augustus_arabi CDS 2379 2489 0.8 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=GFFTree_mrna675
Fes_sc0000001.1 augustus_arabi exon 2379 2489 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=GFFTree_mrna675
Fes_sc0000001.1 GFFTree mrna 2605 2676 . + . ID=GFFTree_mrna676;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna676;translation_stable_id=GFFTree_mrna676
Fes_sc0000001.1 augustus_arabi CDS 2605 2676 0.99 + 1 ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=GFFTree_mrna676
Fes_sc0000001.1 augustus_arabi exon 2605 2676 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=GFFTree_mrna676
Fes_sc0000001.1 GFFTree mrna 3172 3354 . + . ID=GFFTree_mrna677;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna677;translation_stable_id=GFFTree_mrna677
Fes_sc0000001.1 augustus_arabi CDS 3172 3354 1 + 2 ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=GFFTree_mrna677
Fes_sc0000001.1 augustus_arabi exon 3172 3354 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=GFFTree_mrna677
Fes_sc0000001.1 GFFTree mrna 3856 4212 . + . ID=GFFTree_mrna678;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna678;translation_stable_id=GFFTree_mrna678
Fes_sc0000001.1 augustus_arabi CDS 3856 4212 1 + 2 ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=GFFTree_mrna678
Fes_sc0000001.1 augustus_arabi exon 3856 4212 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=GFFTree_mrna678
Fes_sc0000001.1 GFFTree mrna 4295 4405 . + . ID=GFFTree_mrna679;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna679;translation_stable_id=GFFTree_mrna679
Fes_sc0000001.1 augustus_arabi CDS 4296 4405 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=GFFTree_mrna679
Fes_sc0000001.1 GFFTree mrna 589 645 . + . ID=GFFTree_mrna680;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna680;translation_stable_id=GFFTree_mrna680
Fes_sc0000001.1 augustus_arabi CDS 590 645 0.39 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=GFFTree_mrna680
Fes_sc0000001.1 GFFTree mrna 727 959 . + . ID=GFFTree_mrna681;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna681;translation_stable_id=GFFTree_mrna681
Fes_sc0000001.1 augustus_arabi CDS 727 959 1 + 0 ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=GFFTree_mrna681
Fes_sc0000001.1 augustus_arabi exon 727 959 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=GFFTree_mrna681
Fes_sc0000001.1 GFFTree mrna 4295 4505 . + . ID=GFFTree_mrna682;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna682;translation_stable_id=GFFTree_mrna682
Fes_sc0000001.1 augustus_arabi exon 4295 4505 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=GFFTree_mrna682
Fes_sc0000001.1 GFFTree mrna 560 645 . + . ID=GFFTree_mrna683;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna683;translation_stable_id=GFFTree_mrna683
Fes_sc0000001.1 augustus_arabi exon 560 645 . + . ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=GFFTree_mrna683
###
Here, mRNA was created for each exon, which creates a lot of WARNINGS in the validation process.
Removing _CONDITION_9
and using only default.ini
gave, basically, the same result.
So, I will go with my script for a while. Sorry but thank you for your advice.
As for the preprocessing mechanism, I am glad you like it. I have created a pull request.
But before merging the code, we have to consider one thing.
I have inserted the preprocessing code only in easy-import/core/prepare_gff.pl
.
So some user could skip the preparation process and import the data directly.
Or some user could try to execute the script multiple times, which destroy the gff.
Do you have any idea about about avoiding those happen?
Thanks for the detailed report, sorry the built in methods didn't work but good that it can be solved by preprocessing.
I don't think the first you raised will be too much of a problem in practice. Skipping the prepare step also means that the built in methods would not be applied so the preprocessing step is no different here. If the GFF needs fixing it should generate enough warnings/errors to indicate something is wrong.
As for running the script multiple times, this could be a problem and I'm not sure of the best solution. I like that preprocessing is generic so it could be used to update sequence file headers as well/instead of gff, but that complicates finding a way to test if the file has been changed. One idea would be to calculate checksums for each GFF/FASTA infile before preprocessing and write them to disk after preprocessing:
Of course having just typed the checksum version I'm realising that this could be simplified to just touch a file to flag that preprocessing has been done.
I like the idea of checksum.
In that case, we'd better to change the format of ini
, like,
[FILES]
SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
GFF = [ /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3; mv temp.gff3 FES_r1.0.genes.gff3 ]
SCAFFOLD = [ ...(Scaffold modification commands)... ]
Since this format clarifies the target file for each preprocess, it is easier for the program to determine which checksum should be checked before executing the command.
Or you could represent the target file by {}
[FILES]
SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
GFF = [ /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh {} > temp.gff3; mv temp.gff3 {} ]
SCAFFOLD = [ ...(Scaffold modification commands)... ]
Thanks, reformatting the ini
makes sense to associate files with appropriate checksums.
I prefer your second suggestion using {}
to represent the file.
OK. I'll think about the implementation and will renew the pull request. But it could take a few weeks because I'm kind of busy now...
First of all, let me say thank you for sharing the great project!
But I'm having a hard time importing a new genome. I found my gff needs to be modified before the import. So I added some lines in the ini file.
Here, the script for gff modification (/import/src/modify_gff3.sh) was as follows:
I checked the modified gff by this script was imported correctly.
The script was shared with the container by the docker command option:
But the command output many warnings and the script didn't seem to be working. Am I doing wrong?
Any comment would be appreciated. Thank you.