glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Problems with Arabidopsis refseq datasets #1512

Open katewarner opened 5 days ago

katewarner commented 5 days ago

The following processed datasets are empty. Check if the original refseq reports for Arabidopsis contain this information.

arabidopsis_protein_citations_refseq.csv arabidopsis_protein_function_refseq.csv

If not, update dataset-masterlist.json

katewarner commented 5 days ago

@rykahsay it looks like Arabidopsis refseq reports contain references so I'm wondering how the *_protein_citations_refseq.csv files created? Does the script exclude certain references such as submission or genome references? Could I possibly have the location or a copy of your dataset script so that I can take a look?

rykahsay commented 5 days ago

arabidopsis_protein_citations_refseq.csv is created based on arabidopsis_protein_function_refseq.csv. So, what we need to address is that why arabidopsis_protein_function_refseq.csv is empty.

The "function" text is extracted from the reports from the following sections/conditions:

  1. in the "COMMENT" section there is text that contains the word "Summary:"
  2. in the "REMARK" section there is text that contains the word "GeneRIF:"

What you need to check in the Arabidopsis refseq reports is if there exist a record that satisfies one of the above two conditions. If not, both datasets should not be created (should be removed from the masterlist).

katewarner commented 4 days ago

I couldn't find any Arabidopsis reports that contained either of those two conditions. Other plant species seem to have remark sections containing the word GeneRIF, but not Arabidopsis.

I've removed them from the dataset-masterlist.json objects. Please let me know if I can close this ticket or if there is anything else I need to do.