Closed bradfordcondon closed 6 years ago
select * from chado.feature where name = 'Gleditsia_triacanthos_082614_comp10193_c0_seq2';
749468 15 Gleditsia_triacanthos_082614_comp10193_c0_seq2 Gleditsia_triacanthos_082614_comp10193_c0_seq2 AAACTGTTTCCCGTGTAAAAATGAAGCAGCCACAAGATTGAGGCTCTATTTTCTTTGAAAAGAATGATGAATGAAACTAAGAATATGGTAGAATATTCAGTTCCATGTTCTCGTAGTATTCATAAAATTCTACTAAAAACATGTAGGCACCAACTGCTTCAACTTACAATGATACGATGCTCGCAAATTATGTGCCAGTATACGTGATGCTCCCACTAGGAGTTGTTACAATTGATAATGCCTTAGAAGACAGAGATGGGCTTGAGAAACAGCTCAAAGAGCTACGAGCAGCAGGTGTTGATGGGGTTATGGTTGATGTCTGGTGGGGTATTGTAGAATCCAGGGGGCCTAAGCAGTATGATTGGTCCGCTTATAGGAGATTGTTTCAACTAGTTCAAGAATGTGGATTGAAGTTACAAGCTATAATGTCATTCCACCAATGTGGAGGAAATGTGGGGGATTCTGTTTTCATCCCTCTACCTGAATGGGTACTTGCAATTGGAGAATCAAACACTGATATCTTTTACACCGATCGCACAGGTAACAGGAACAAGGAATATCTCACTCTTGGTGTGGACAACAAGCCTCTATTTCATGGTCGAACAGCCATTGAGCTATATAGTGACTATATGAAGAGCTTCAGAGAGAATATGGCAGATTTTTTGGAATCTGAACTCATGATTGACATTGAAGTAGGGCTTGGTCCTGCGGGAGAACTCAGATACCCCTCTTATACAGAAAGTCAGGGATGGAAATTTCCTGGTATTGGAAAATTTCAGTGCTATGACAAATATCTTAGAGCTGATTTCAAAGAGGCTGTGGCAAGAGCAGACCATTTTGAATGGGAGCTTCCAGATAATGCAGGGGAATACAATAGCAAACCAGAATCTACAGATTTTTTCAGATCAAATGGGACTTACCTGACAGAGAAAGGGAAGTTTTTCCTGACATGGTATTCCAACAAGTTGCTGAACCATGGTGATGAGATCCTGGATGAAGCCAACAAAGCGTTCCTAGGTGGCTGGAATCCACTGGTGGTACAAAACAGAAAATCATGCAGCAGAACTCAATTCAGGATATTACAATCTAAATGATAGAGATGGATACCGTCCTATAGCAAGGATGCTCTCTCGGCATAATGCGATATTGAATTTTACATGTCTTGAGATGAGGAACTCTGAACAAAATGCTGGGGCAAAAAGTGGTGCTCAGGAACTTGTTCAGCAGGTCCTGAGTGGAGGATGGAGAGAGAACCTTGAAGTTGCAGGAGAAAATGCACTTTCAAGGTATGACAGTGCAGGTTATAACCAGATTCTTCTAAATGCTAGACCAAATGGTGTTAATAGAAGGGGCCCTCCAAAGCTAAGGATGTATGGCGTGACATACTTGCGTTTGTCAGATGAGTTACTGCAGAAATCAAATTTTTACACATTCAAAACCTTTGTGAGGAAGATGCATGCTGATCTGGATTACTGTCCAGACCCAGAAAAGTACAATCATTACACAGTCCCCATGGATCGGTCAAAGCCCAGAATTCCAGTGGAGGTTCTTCTTGAAGCAACCAAACCAGTGGAGCCATTTCCATGGGATAAACAGACAGATATGAGCGTTGGCAGTGCACTTACAGATCTTTTAGGAAAACTTTTTTCTATATGCCTGCTAGACCAGAAATAAATGGGTAAAAATGGCACCAAAAAAAAAACATAAACTTAATGAAACGGAATTACTGTTAGACACATAAATGAAG 1747 a99438fb616103f7209a5afc1ef86e11 101394 f f 2015-04-14 21:46:56.997911 2015-04-14 21:46:56.997911
only a single mRNA. however there might be multiple proteins linked to this mRNA.
>Gleditsia_triacanthos_082614_comp10193_c0_seq2
MLSRHNAILNFTCLEMRNSEQNAGAKSGAQELVQQVLSGGWRENLEVAGENALSRYDSAGYNQILLNARPNGVNRRGPPKLRMYGVTYLRLSDELLQKSNFYTFKTFVRKMHADLDYCPDPEKYNHYTVPMDRSKPRIPVEVLLEATKPVEPFPWDKQTDMSVGSALTDLLGKLFSICLLDQK
>Gleditsia_triacanthos_082614_comp10193_c0_seq2
MLANYVPVYVMLPLGVVTIDNALEDRDGLEKQLKELRAAGVDGVMVDVWWGIVESRGPKQYDWSAYRRLFQLVQECGLKLQAIMSFHQCGGNVGDSVFIPLPEWVLAIGESNTDIFYTDRTGNRNKEYLTLGVDNKPLFHGRTAIELYSDYMKSFRENMADFLESELMIDIEVGLGPAGELRYPSYTESQGWKFPGIGKFQCYDKYLRADFKEAVARADHFEWELPDNAGEYNSKPESTDFFRSNGTYLTEKGKFFLTWYSNKLLNHGDEILDEANKAFLGGWNPLVVQNRKSCSRTQFRILQSK
select * from chado.feature where residues='MLSRHNAILNFTCLEMRNSEQNAGAKSGAQELVQQVLSGGWRENLEVAGENALSRYDSAGYNQILLNARPNGVNRRGPPKLRMYGVTYLRLSDELLQKSNFYTFKTFVRKMHADLDYCPDPEKYNHYTVPMDRSKPRIPVEVLLEATKPVEPFPWDKQTDMSVGSALTDLLGKLFSICLLDQK';
1537317 15 Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4058 Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4058 MLSRHNAILNFTCLEMRNSEQNAGAKSGAQELVQQVLSGGWRENLEVAGENALSRYDSAGYNQILLNARPNGVNRRGPPKLRMYGVTYLRLSDELLQKSNFYTFKTFVRKMHADLDYCPDPEKYNHYTVPMDRSKPRIPVEVLLEATKPVEPFPWDKQTDMSVGSALTDLLGKLFSICLLDQK 183 5d574c82f2e8357c028c97cf62ee82b1 236 f f 2015-06-23 18:29:16.791567 2015-06-23 18:29:16.791567
and
select * from chado.feature where residues='MLANYVPVYVMLPLGVVTIDNALEDRDGLEKQLKELRAAGVDGVMVDVWWGIVESRGPKQYDWSAYRRLFQLVQECGLKLQAIMSFHQCGGNVGDSVFIPLPEWVLAIGESNTDIFYTDRTGNRNKEYLTLGVDNKPLFHGRTAIELYSDYMKSFRENMADFLESELMIDIEVGLGPAGELRYPSYTESQGWKFPGIGKFQCYDKYLRADFKEAVARADHFEWELPDNAGEYNSKPESTDFFRSNGTYLTEKGKFFLTWYSNKLLNHGDEILDEANKAFLGGWNPLVVQNRKSCSRTQFRILQSK';
1537318 15 Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4057 Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4057 MLANYVPVYVMLPLGVVTIDNALEDRDGLEKQLKELRAAGVDGVMVDVWWGIVESRGPKQYDWSAYRRLFQLVQECGLKLQAIMSFHQCGGNVGDSVFIPLPEWVLAIGESNTDIFYTDRTGNRNKEYLTLGVDNKPLFHGRTAIELYSDYMKSFRENMADFLESELMIDIEVGLGPAGELRYPSYTESQGWKFPGIGKFQCYDKYLRADFKEAVARADHFEWELPDNAGEYNSKPESTDFFRSNGTYLTEKGKFFLTWYSNKLLNHGDEILDEANKAFLGGWNPLVVQNRKSCSRTQFRILQSK 305 89ecf2889d325851cd621c58f95b6a35 236 f f 2015-06-23 18:29:16.791567 2015-06-23 18:29:16.791567
imo, the best way to do this and avoid memory leaks is to create a temp table and populate with all the mRNA ids that we want to print while ensuring that we don't insert dupes. We can then join with the feature table and print chunks at a time.
ok. I think the problem is that when G. triacanthos was loaded, a bad regexp was used to link the proteins to the parent mRNA. Need to determine which is the "real" protein (easy, just translate the mRNA). but then, what to do with the mis-loaded protein?
so after investigation, this isnt a bug: the duplicates are probably alternate splices of the same gene. Or put anotehr way: the proteins do indeed have a relationship with the same mrna.
Leaving the issue open in case we need to deal with the problem to get it loaded.
Related problem: IPS creates feature names that are not in the DB. what happens?
Acer_saccharum_022416_comp50491_c0_seq2_2
throws an error in the IPS loader, but Acer_saccharum_022416_comp50491_c0_seq2_2
was not in the input FASTA file generated by quick fasta. Rather, Acer_saccharum_022416_comp50491_c0_seq2
is. Why? because there are two proteins for this mRNA. In theory we want BOTH sets of annotations. For now we are ignoring the second protein.
A new error showed up:
feature and is being skipped.
WD tr_ipr_parse: Ambiguous: 'Haimp10017244m' matches more than one [warning]
feature and is being skipped.
WD tr_ipr_parse: Ambiguous: 'Haimp10017472m' matches more than one [warning]
feature and is being skipped.
I think both proteins and mRNA's are named the same for this organism so we have to specify the type to match against or change the names of the proteins.
Also note that for the 309 files all of them returned errors of having duplicates. So we probably have duplicate parents for all of our organisms.
I think both proteins and mRNA's are named the same for this organism so we have to specify the type to match against or change the names of the proteins.
correct. I would advocate annotating this organism separately so we can specify the type.
Same warning appeared for the following:
WD tr_ipr_parse: Ambiguous: [warning]
'snap_masked-scaffold03047-abinit-gene-0.11-mRNA-1' matches more than
one feature and is being skipped.
WD tr_ipr_parse: Ambiguous: [warning]
'augustus_masked-scaffold05692-abinit-gene-0.4-mRNA-1' matches more
than one feature and is being skipped.
WD tr_ipr_parse: Ambiguous: [warning]
'maker-scaffold00218-augustus-gene-0.35-mRNA-1' matches more than one
feature and is being skipped.
https://github.com/bradfordcondon/simple_biopython/blob/master/staton_biopy/concat_duplicates.py
simple python script to concatenate multiple proteins pointed at the same parent.
Potential problem: The coordinates will now be screwed up counterpoint, I think they would have been screwed up anyway since the locations correspond to the protein and not the mRNA location.
Additionally, hte non-unique feature names (HAIMP, chestnut) will need to be run individually so that a type can be specified
Sorry to say this but looking at the log I found another organism that will need to be loaded separately: Alnus rubra
'Alnus_rubra_021816_comp10639_c0_seq6_2' in the database.
WD tr_ipr_parse: Failed: cannot find a matching feature for [error]
'Alnus_rubra_021816_comp10639_c0_seq10_1' in the database.
WD tr_ipr_parse: Failed: cannot find a matching feature for [error]
We need a query that will let us know which organisms don't have unique names.
New fasta dump jobs:
Once these complete, we can find the files in
/var/www/html/sites/default/files/proteins_from_145.fasta
/var/www/html/sites/default/files/proteins_from_101394.fasta
Steps after the files are created:
sed 's/\*//g' old_file_name > new_file_name
jobs ran. Have donwloaded files, ran the concatenator, and running sed to remove asterisks. @MattHuff whats the most convenient way for me to get the files to you? I dont know if theres a placed we have shared permissions on ACF: what was abdullah doing to get you the files?
I gave him wget commands make it easy. But I think he has ssh access so you can just give him the path on the dev server.
ok, i use biopython so i ran locally, ill re-up to dev
files are at /var/www/html/sites/default/files/proteins_from_145_no_ast_cat.fasta
and /var/www/html/sites/default/files/proteins_from_101394_no_ast_cat.fasta
@MattHuff
how many sequences are in each file?
fgrep -o '>' /var/www/html/sites/default/files/proteins_from_145_no_ast_cat.fasta | wc -l
282114
fgrep -o '>' /var/www/html/sites/default/files/proteins_from_101394_no_ast_cat.fasta | wc -l
439436
matches with
select count(*) from chado.feature f inner join chado.cvterm c on c.cvterm_id = f.type_id where c.name = 'polypeptide';
782045
this is resolved.
Bug description
when submitting the kegg ghost koala job
next steps