Open katewarner opened 5 days ago
@rykahsay it looks like Arabidopsis refseq reports contain references so I'm wondering how the *_protein_citations_refseq.csv files created? Does the script exclude certain references such as submission or genome references? Could I possibly have the location or a copy of your dataset script so that I can take a look?
arabidopsis_protein_citations_refseq.csv is created based on arabidopsis_protein_function_refseq.csv. So, what we need to address is that why arabidopsis_protein_function_refseq.csv is empty.
The "function" text is extracted from the reports from the following sections/conditions:
What you need to check in the Arabidopsis refseq reports is if there exist a record that satisfies one of the above two conditions. If not, both datasets should not be created (should be removed from the masterlist).
I couldn't find any Arabidopsis reports that contained either of those two conditions. Other plant species seem to have remark sections containing the word GeneRIF, but not Arabidopsis.
I've removed them from the dataset-masterlist.json objects. Please let me know if I can close this ticket or if there is anything else I need to do.
The following processed datasets are empty. Check if the original refseq reports for Arabidopsis contain this information.
arabidopsis_protein_citations_refseq.csv arabidopsis_protein_function_refseq.csv
If not, update dataset-masterlist.json