duplicates in quick fasta dump

bradfordcondon commented 6 years ago

Bug description

Data check error at line number 183.<br />Query sequence data contain duplicate entry (Gleditsia_triacanthos_082614_comp10193_c0_seq2).

when submitting the kegg ghost koala job

next steps

determine if dupe is in db or not
write script to remove dupes from FASTA file?
modify quick FASTA to not print/alert on duplicates

bradfordcondon commented 6 years ago

select * from chado.feature where name = 'Gleditsia_triacanthos_082614_comp10193_c0_seq2';
749468      15  Gleditsia_triacanthos_082614_comp10193_c0_seq2  Gleditsia_triacanthos_082614_comp10193_c0_seq2  AAACTGTTTCCCGTGTAAAAATGAAGCAGCCACAAGATTGAGGCTCTATTTTCTTTGAAAAGAATGATGAATGAAACTAAGAATATGGTAGAATATTCAGTTCCATGTTCTCGTAGTATTCATAAAATTCTACTAAAAACATGTAGGCACCAACTGCTTCAACTTACAATGATACGATGCTCGCAAATTATGTGCCAGTATACGTGATGCTCCCACTAGGAGTTGTTACAATTGATAATGCCTTAGAAGACAGAGATGGGCTTGAGAAACAGCTCAAAGAGCTACGAGCAGCAGGTGTTGATGGGGTTATGGTTGATGTCTGGTGGGGTATTGTAGAATCCAGGGGGCCTAAGCAGTATGATTGGTCCGCTTATAGGAGATTGTTTCAACTAGTTCAAGAATGTGGATTGAAGTTACAAGCTATAATGTCATTCCACCAATGTGGAGGAAATGTGGGGGATTCTGTTTTCATCCCTCTACCTGAATGGGTACTTGCAATTGGAGAATCAAACACTGATATCTTTTACACCGATCGCACAGGTAACAGGAACAAGGAATATCTCACTCTTGGTGTGGACAACAAGCCTCTATTTCATGGTCGAACAGCCATTGAGCTATATAGTGACTATATGAAGAGCTTCAGAGAGAATATGGCAGATTTTTTGGAATCTGAACTCATGATTGACATTGAAGTAGGGCTTGGTCCTGCGGGAGAACTCAGATACCCCTCTTATACAGAAAGTCAGGGATGGAAATTTCCTGGTATTGGAAAATTTCAGTGCTATGACAAATATCTTAGAGCTGATTTCAAAGAGGCTGTGGCAAGAGCAGACCATTTTGAATGGGAGCTTCCAGATAATGCAGGGGAATACAATAGCAAACCAGAATCTACAGATTTTTTCAGATCAAATGGGACTTACCTGACAGAGAAAGGGAAGTTTTTCCTGACATGGTATTCCAACAAGTTGCTGAACCATGGTGATGAGATCCTGGATGAAGCCAACAAAGCGTTCCTAGGTGGCTGGAATCCACTGGTGGTACAAAACAGAAAATCATGCAGCAGAACTCAATTCAGGATATTACAATCTAAATGATAGAGATGGATACCGTCCTATAGCAAGGATGCTCTCTCGGCATAATGCGATATTGAATTTTACATGTCTTGAGATGAGGAACTCTGAACAAAATGCTGGGGCAAAAAGTGGTGCTCAGGAACTTGTTCAGCAGGTCCTGAGTGGAGGATGGAGAGAGAACCTTGAAGTTGCAGGAGAAAATGCACTTTCAAGGTATGACAGTGCAGGTTATAACCAGATTCTTCTAAATGCTAGACCAAATGGTGTTAATAGAAGGGGCCCTCCAAAGCTAAGGATGTATGGCGTGACATACTTGCGTTTGTCAGATGAGTTACTGCAGAAATCAAATTTTTACACATTCAAAACCTTTGTGAGGAAGATGCATGCTGATCTGGATTACTGTCCAGACCCAGAAAAGTACAATCATTACACAGTCCCCATGGATCGGTCAAAGCCCAGAATTCCAGTGGAGGTTCTTCTTGAAGCAACCAAACCAGTGGAGCCATTTCCATGGGATAAACAGACAGATATGAGCGTTGGCAGTGCACTTACAGATCTTTTAGGAAAACTTTTTTCTATATGCCTGCTAGACCAGAAATAAATGGGTAAAAATGGCACCAAAAAAAAAACATAAACTTAATGAAACGGAATTACTGTTAGACACATAAATGAAG 1747    a99438fb616103f7209a5afc1ef86e11    101394  f   f   2015-04-14 21:46:56.997911  2015-04-14 21:46:56.997911

only a single mRNA. however there might be multiple proteins linked to this mRNA.

>Gleditsia_triacanthos_082614_comp10193_c0_seq2
MLSRHNAILNFTCLEMRNSEQNAGAKSGAQELVQQVLSGGWRENLEVAGENALSRYDSAGYNQILLNARPNGVNRRGPPKLRMYGVTYLRLSDELLQKSNFYTFKTFVRKMHADLDYCPDPEKYNHYTVPMDRSKPRIPVEVLLEATKPVEPFPWDKQTDMSVGSALTDLLGKLFSICLLDQK
>Gleditsia_triacanthos_082614_comp10193_c0_seq2
MLANYVPVYVMLPLGVVTIDNALEDRDGLEKQLKELRAAGVDGVMVDVWWGIVESRGPKQYDWSAYRRLFQLVQECGLKLQAIMSFHQCGGNVGDSVFIPLPEWVLAIGESNTDIFYTDRTGNRNKEYLTLGVDNKPLFHGRTAIELYSDYMKSFRENMADFLESELMIDIEVGLGPAGELRYPSYTESQGWKFPGIGKFQCYDKYLRADFKEAVARADHFEWELPDNAGEYNSKPESTDFFRSNGTYLTEKGKFFLTWYSNKLLNHGDEILDEANKAFLGGWNPLVVQNRKSCSRTQFRILQSK

bradfordcondon commented 6 years ago

select * from chado.feature where residues='MLSRHNAILNFTCLEMRNSEQNAGAKSGAQELVQQVLSGGWRENLEVAGENALSRYDSAGYNQILLNARPNGVNRRGPPKLRMYGVTYLRLSDELLQKSNFYTFKTFVRKMHADLDYCPDPEKYNHYTVPMDRSKPRIPVEVLLEATKPVEPFPWDKQTDMSVGSALTDLLGKLFSICLLDQK';
1537317     15  Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4058   Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4058   MLSRHNAILNFTCLEMRNSEQNAGAKSGAQELVQQVLSGGWRENLEVAGENALSRYDSAGYNQILLNARPNGVNRRGPPKLRMYGVTYLRLSDELLQKSNFYTFKTFVRKMHADLDYCPDPEKYNHYTVPMDRSKPRIPVEVLLEATKPVEPFPWDKQTDMSVGSALTDLLGKLFSICLLDQK 183 5d574c82f2e8357c028c97cf62ee82b1    236 f   f   2015-06-23 18:29:16.791567  2015-06-23 18:29:16.791567

and

select * from chado.feature where residues='MLANYVPVYVMLPLGVVTIDNALEDRDGLEKQLKELRAAGVDGVMVDVWWGIVESRGPKQYDWSAYRRLFQLVQECGLKLQAIMSFHQCGGNVGDSVFIPLPEWVLAIGESNTDIFYTDRTGNRNKEYLTLGVDNKPLFHGRTAIELYSDYMKSFRENMADFLESELMIDIEVGLGPAGELRYPSYTESQGWKFPGIGKFQCYDKYLRADFKEAVARADHFEWELPDNAGEYNSKPESTDFFRSNGTYLTEKGKFFLTWYSNKLLNHGDEILDEANKAFLGGWNPLVVQNRKSCSRTQFRILQSK';
1537318     15  Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4057   Gleditsia_triacanthos_082614_comp10193_c0_seq2_m.4057   MLANYVPVYVMLPLGVVTIDNALEDRDGLEKQLKELRAAGVDGVMVDVWWGIVESRGPKQYDWSAYRRLFQLVQECGLKLQAIMSFHQCGGNVGDSVFIPLPEWVLAIGESNTDIFYTDRTGNRNKEYLTLGVDNKPLFHGRTAIELYSDYMKSFRENMADFLESELMIDIEVGLGPAGELRYPSYTESQGWKFPGIGKFQCYDKYLRADFKEAVARADHFEWELPDNAGEYNSKPESTDFFRSNGTYLTEKGKFFLTWYSNKLLNHGDEILDEANKAFLGGWNPLVVQNRKSCSRTQFRILQSK   305 89ecf2889d325851cd621c58f95b6a35    236 f   f   2015-06-23 18:29:16.791567  2015-06-23 18:29:16.791567

almasaeed2010 commented 6 years ago

imo, the best way to do this and avoid memory leaks is to create a temp table and populate with all the mRNA ids that we want to print while ensuring that we don't insert dupes. We can then join with the feature table and print chunks at a time.

bradfordcondon commented 6 years ago

ok. I think the problem is that when G. triacanthos was loaded, a bad regexp was used to link the proteins to the parent mRNA. Need to determine which is the "real" protein (easy, just translate the mRNA). but then, what to do with the mis-loaded protein?

bradfordcondon commented 6 years ago

so after investigation, this isnt a bug: the duplicates are probably alternate splices of the same gene. Or put anotehr way: the proteins do indeed have a relationship with the same mrna.

Leaving the issue open in case we need to deal with the problem to get it loaded.

bradfordcondon commented 6 years ago

Related problem: IPS creates feature names that are not in the DB. what happens?

Acer_saccharum_022416_comp50491_c0_seq2_2 throws an error in the IPS loader, but Acer_saccharum_022416_comp50491_c0_seq2_2 was not in the input FASTA file generated by quick fasta. Rather, Acer_saccharum_022416_comp50491_c0_seq2 is. Why? because there are two proteins for this mRNA. In theory we want BOTH sets of annotations. For now we are ignoring the second protein.

almasaeed2010 commented 6 years ago

A new error showed up:

feature and is being skipped.
WD tr_ipr_parse: Ambiguous: 'Haimp10017244m' matches more than one   [warning]
feature and is being skipped.
WD tr_ipr_parse: Ambiguous: 'Haimp10017472m' matches more than one   [warning]
feature and is being skipped.

I think both proteins and mRNA's are named the same for this organism so we have to specify the type to match against or change the names of the proteins.

almasaeed2010 commented 6 years ago

Also note that for the 309 files all of them returned errors of having duplicates. So we probably have duplicate parents for all of our organisms.

bradfordcondon commented 6 years ago

I think both proteins and mRNA's are named the same for this organism so we have to specify the type to match against or change the names of the proteins.

correct. I would advocate annotating this organism separately so we can specify the type.

almasaeed2010 commented 6 years ago

Same warning appeared for the following:

WD tr_ipr_parse: Ambiguous:                                          [warning]
'snap_masked-scaffold03047-abinit-gene-0.11-mRNA-1' matches more than
one feature and is being skipped.
WD tr_ipr_parse: Ambiguous:                                          [warning]
'augustus_masked-scaffold05692-abinit-gene-0.4-mRNA-1' matches more
than one feature and is being skipped.
WD tr_ipr_parse: Ambiguous:                                          [warning]
'maker-scaffold00218-augustus-gene-0.35-mRNA-1' matches more than one
feature and is being skipped.

bradfordcondon commented 6 years ago

https://github.com/bradfordcondon/simple_biopython/blob/master/staton_biopy/concat_duplicates.py

simple python script to concatenate multiple proteins pointed at the same parent.

Potential problem: The coordinates will now be screwed up counterpoint, I think they would have been screwed up anyway since the locations correspond to the protein and not the mRNA location.

Additionally, hte non-unique feature names (HAIMP, chestnut) will need to be run individually so that a type can be specified

almasaeed2010 commented 6 years ago

Sorry to say this but looking at the log I found another organism that will need to be loaded separately: Alnus rubra

'Alnus_rubra_021816_comp10639_c0_seq6_2' in the database.
WD tr_ipr_parse: Failed: cannot find a matching feature for              [error]
'Alnus_rubra_021816_comp10639_c0_seq10_1' in the database.
WD tr_ipr_parse: Failed: cannot find a matching feature for              [error]

almasaeed2010 commented 6 years ago

We need a query that will let us know which organisms don't have unique names.

almasaeed2010 commented 6 years ago

New fasta dump jobs:

[x] mRNA
[x] mRNA_contig

Once these complete, we can find the files in

mRNA: /var/www/html/sites/default/files/proteins_from_145.fasta
mRNA_contig: /var/www/html/sites/default/files/proteins_from_101394.fasta

Steps after the files are created:

[x] Run the python script to remove duplicates
[x] Remove asterisk signs from the file sed 's/\*//g' old_file_name > new_file_name
[x] Count sequences in each file and give @MattHuff the files to run IPS
[ ] Repeat for live

bradfordcondon commented 6 years ago

jobs ran. Have donwloaded files, ran the concatenator, and running sed to remove asterisks. @MattHuff whats the most convenient way for me to get the files to you? I dont know if theres a placed we have shared permissions on ACF: what was abdullah doing to get you the files?

almasaeed2010 commented 6 years ago

I gave him wget commands make it easy. But I think he has ssh access so you can just give him the path on the dev server.

bradfordcondon commented 6 years ago

ok, i use biopython so i ran locally, ill re-up to dev

bradfordcondon commented 6 years ago

files are at /var/www/html/sites/default/files/proteins_from_145_no_ast_cat.fastaand /var/www/html/sites/default/files/proteins_from_101394_no_ast_cat.fasta @MattHuff

bradfordcondon commented 6 years ago

how many sequences are in each file?

fgrep -o '>' /var/www/html/sites/default/files/proteins_from_145_no_ast_cat.fasta | wc -l
282114
fgrep -o '>' /var/www/html/sites/default/files/proteins_from_101394_no_ast_cat.fasta | wc -l
439436

matches with

select count(*) from chado.feature f inner join chado.cvterm c on c.cvterm_id = f.type_id where c.name = 'polypeptide';
782045

bradfordcondon commented 6 years ago

this is resolved.

statonlab / hardwoods_site

duplicates in quick fasta dump #360

Bug description

next steps