glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Create FlyBase Linkout File #863

Open jeet-vora opened 1 year ago

jeet-vora commented 1 year ago

https://wiki.flybase.org/wiki/FlyBase:Links_to_and_from_FlyBase#Links_from_FlyBase

Links from FlyBase

FlyBase supports linkouts from any FlyBase object that has a stable FlyBase ID (e.g. FBxx[0-9]+) and a web report. Databases suitable for this kind of linking to FlyBase are those with mature data structures whose data are expressed in terms of FlyBase genetic objects that carry stable identifiers or as sequences that can be mapped to the reference sequence of a Drosophila species. FlyBase currently accepts linkout data in a simple spreadsheet table (see below), plus a summary record for the external database with link information and name. We are happy to consider additional linkout databases. Please contact us if you would like to contribute links to your database.

FlyBase-curated links and linkouts are displayed on the Report Pages in the most appropriate section of the Report. Linkouts are indicated by a Linkout label in parentheses after the field label. In addition, on the Gene Report, all FlyBase-curated links and linkouts are also grouped together in a single External crossreferences & Linkouts section. How to establish linkouts

[Contact us](http://flybase.org/contact/email) with a brief description of your database and links to your website. Please be sure to include links to your main site as well as the report pages that you would like us to link to.
Validate your FlyBase IDs using our [ID Converter tool](http://flybase.org/convert/id).
Construct your link table and database information file making sure that you meet the guidelines set forth in Linkout Requirements
Contact us to let us know that you have finished preparing your files and are ready to make a submission. You will receive an email with instructions on how to upload your files. Multiple files should be tar gzipped or zip compressed into a single file.
Update your links at least once a year from the time of your previous submission.

Please note that if you are establishing a single type of linkout between FlyBase and your site then only a single linking table and database information file is required. If you want to establish multiple types of linkouts then you need to submit a linking table and database information file for each type. Linkout requirements

The linkout link targets (the web reports that the URLs redirect to) must provide data that isn't available in the FlyBase report.
Linkout links can only be established for the subset of FlyBase objects that you have additional data for. Links cannot lead to an error page, a blank report or a report that provides no additional data about the FlyBase object that is being linked from.
Linkout data must be updated once a year. Linkout data that has not been updated in over a year will be dropped from FlyBase.
FlyBase IDs must be validated using our ID Converter tool to ensure that you are using current FlyBase IDs. Linkout links that refer to old FlyBase IDs will be automatically dropped.
FlyBase reserves the right to reject or remove linkouts if these requirements are not met.

Linkout Submission Format Link table

The link table format is a simple 4 column tab delimited file. The description of the columns in order is show below. The filename of this file must use the form

_linkout.txt Replace with the value used in column 2 of the same file. Column 1 - FlyBase ID A valid FlyBase ID matching this regular expression: '^FB\w\w\d+\t' Column 2 - DBNAME Some unique/standard name for external database. Alpha-numeric only 'A-z0-9'. If you are submitting more than one linking table you need to ensure that the DBNAME is unique to each file. Reusing a DbName once it is used in another linking table is not permitted. For example, if a group named "FLYLAB" wanted to establish links between FlyBase gene reports and 2 different types of analysis on their web site they could use "FLYLAB_EX1" and "FLYLAB_EX2" for the DbName column in their linkout files. Column 3 - DBID External database object id. This field can either be an ID or a short phrase (e.g. name of a pathway/reaction). Spaces are allowed, but tabs are not. This field cannot exceed 255 characters. Column 4 - DBURL Relative link to external database web report. This is the text that will be appended to the base URL parameter that is defined in the[ database information file](https://wiki.flybase.org/wiki/FlyBase:Links_to_and_from_FlyBase#Database_information_file). Database information file The database information file contains the DbName that it corresponds to, the base URL to use for linkout hyperlinks, the homepage URL for your site and a brief description of your database. The filename of this file must use the form _dbinfo.txt Replace with the value use in column 2 of the link table file that this file corresponds to. The format of this file uses a simple FIELDVALUE format. The field names are as follows Line 1 - DBNAME The DBNAME value used in column 2 of the link table. Line 2 - BASEURL The base URL to use when constructing links to your database. Line 3 - HOMEURL The homepage URL that represents the front page of your database. Line 4 - DESC A brief description of your database. Line 5 - EMAIL The email to use should we need to contact you. File examples Example 1 GenBank_dbinfo.txt DBNAME GENBANK BASEURL http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val= HOMEURL http://www.ncbi.nlm.nih.gov/ DESC A genetic sequence database. EMAIL johndoe@nowhere.com GenBank_linkout.txt #Flybase ID DBNAME DBID DBURL FBgn0259750 GENBANK AAA86639 AAA86639 FBgn0005561 GENBANK AAB70249 AAB70249 Example 2 UniProt_dbinfo.txt DBNAME UNIPROT BASEURL http://www.uniprot.org/ HOMEURL http://www.uniprot.org/ DESC A database of protein sequence and functional information. EMAIL johndoe@nowhere.com UniProt_linkout.txt #Flybase ID DBNAME DBID DBURL FBgn0259750 UNIPROT O16117 entry/O16117 FBgn0005561 UNIPROT O16804 entry/O16804
rykahsay commented 4 months ago

Please tell me what exactly the columns should be. What I have for now is (placeholder from ncbi_linkouts):

$ head  unreviewed/fruitfly_protein_flybase_linkouts.csv 
ProviderId,Database,UID,URL,IconUrl,UrlName,SubjectType,Attribute
10227,Protein,FBgn0263772,https://glygen.org/protein/M9PJ12,,,,
10227,Protein,FBgn0000635,https://glygen.org/protein/P34082,,,,
10227,Protein,FBgn0011638,https://glygen.org/protein/P40796,,,,
10227,Protein,FBgn0013726,https://glygen.org/protein/P40797,,,,
10227,Protein,FBgn0004389,https://glygen.org/protein/P40794,,,,
10227,Protein,FBgn0000719,https://glygen.org/protein/P40795,,,,
10227,Protein,FBgn0010333,https://glygen.org/protein/P40792,,,,
10227,Protein,FBgn0010341,https://glygen.org/protein/P40793,,,,
10227,Protein,FBgn0011656,https://glygen.org/protein/P40791,,,,
jeet-vora commented 3 months ago

@rykahsay

The output files is not be processed like above. Please see the below instructions to create the output file.

Instructions

Output file name

protein_glygen_flybase_linkout.tsv (format is tsv)

Input files

fruitfly_protein_xref_flybase.csv

Output file example

Please ensure the headers and case of the headers are in the same format as shown below. There are four headers.

FlyBase ID DBNAME DBID DBURL
FBgn0032219 GlyGen Q9VKZ5-1 Q9VKZ5-1
FBgn0053303 GlyGen Q76NQ0-1 Q76NQ0-1
FBgn0041723 GlyGen Q76NQ1-1 Q76NQ1-1

Output file name: protein_glygen_flybase_linkout.tsv

The base URL information for protein details page will be relayed in a different file.

@katewarner Add this protein_glygen_flybase_linkout.tsv into the masterlist as a TSV. It will not be used for API. Also create a BCO by adding relevant info from the ticket to the usability domain.

I have also uploaded the glygen_dbinfo.tsv into SP that will be shared with FlyBase after Robel creates the output file.

rykahsay commented 3 months ago

Check now

$ head unreviewed/protein_glygen_flybase_linkout.tsv
FlyBase ID  DBNAME  DBID    DBURL
FBgn0263772 GlyGen  M9PJ12-1    M9PJ12-1
FBgn0000635 GlyGen  P34082-1    P34082-1
FBgn0011638 GlyGen  P40796-1    P40796-1
FBgn0013726 GlyGen  P40797-1    P40797-1
FBgn0004389 GlyGen  P40794-1    P40794-1
FBgn0000719 GlyGen  P40795-1    P40795-1
FBgn0010333 GlyGen  P40792-1    P40792-1
FBgn0010341 GlyGen  P40793-1    P40793-1
FBgn0011656 GlyGen  P40791-1    P40791-1
katewarner commented 3 months ago

@rykahsay @jeet-vora

I've added the dataset into the masterlist file. I also checked the generated dataset and the Flybase IDs appear to be mapped to the correct GlyGen-UniProt IDs.