Fitting data to Chado schema

btski commented 6 years ago

At @spficklin's suggestion, I'd like to ask the Tripal community for help/feedback on my data assignments to the Chado schema, as shown in the following PowerPoint sheet: Chado_load.pptx

A summary of my data fields: unique ID, source accession, strain/isolate name, strain/isolate source, genotype, author/s, collection location (country/state), collection date (day/month/year), DNA sequence, AA sequence/s, AA sequence name (open reading frame/non-structural protein), AA location.

Categories marked as '!null' cannot be left empty. I tried to piece things together as best I could using the general schema structure (http://gmod.org/wiki/Chado_Tables). I ended up using the cvterm table as a catchall for data that lacked a distinct match, since it allows for mixeded characters (flexible) and is unique to cv terms (a required field). Please let me know your thoughts and if there are any ways to improve my data categorization. Thanks!

P.S. I've attempted to contact the Chado mailing list for assistance but did not receive a response. That was several weeks ago, so it's probably about time I followed up with them.

bradfordcondon commented 6 years ago

Happy to give my opinion. Looking at your slide, am i correct that your main boxes are structure as Proposed chado table (current data field/table)

Apologies but I find that slide confusing to look at (hopefully it was helpful for you to make it!) so instead, I might direct you this way.

You want to load into several tables, namely:

organism
biomaterial
feature
genotype
contact

If your fields are a single table, you are clearly going to have to split it up as you've noticed. If your starting point is all those fields are tied to a single accession in a database, then start with the dbxref table. You'll see the full list of tables that can point to this dbxref table via a foreign key.

So going over your data fields:

source accession: this remains accession in the dbxref table.
strain/isolate name (probably biomaterial or stock. If biomaterial, it would go in biomaterial.name and biomaterial.description) Both of those tables can point to dbxref
genotype - this would go in the genotype table. you can then link to the foreign key of a feature or stock.
author/s - One of two places. if published, put the publication in pub (which can link to the accession via pub_dbxref), and the authors in pubauthor. if not published, use the contact table.
collection location (country/state) if biomaterial, this goes in biomaterial prop. i dont use stock, but stock might have a fancier way to deal with it?
collection date (day/month/year) same as above
DNA sequence. This would go in as a feature, with the sequence in the residues column
AA sequence/s. same as above. you can have the two features linked as child/parent
AA sequence name feature.name
AA location. Depending how involved you want to go, this an be featurepos (feature position) or featurerange. Alternatively, you could add it as a featureprop.

If you need to assign a property to something that isn't a column in the table, you're going to assign it via the prop table that corresponds to it. It looks like you understand this as you're putting collection location in organismprop for organism. except i would suggest putting the location in biomaterial or stock instead, because organism is for the taxonomy (ie all organisms of that species, everywhere in the world)

edit: I also notice sometimes you say cv (country) cv (state). This is definitely incorrect. The cvterm would be day or year, and the CV would refer to a controlled vocabulary that has that term (or local, if you dont want to link it to a controlled vocabulary and want to just use your own language). You can use the EBI Ontology lookup service to find ontologies for your terms (see this one for year which would be DB: SIO, CV: SIO, CVTERM: year, ACCESSION: 000428. If you want to learn more about the CV/DB issue I have written a guide to aid developers.

btski commented 6 years ago

I appreciate the feedback @bradfordcondon and thank you for correcting my misconception with using the cvterm table. I'll refer to your written guide as well and make adjustments accordingly.

btski commented 6 years ago

Some follow up questions:

I'm still a little confused with the cvterm and cv table. The cv table requires selection of a predefined ontology. The cvterm table is a term associated with the cv that the user creates. If I'm understanding everything correctly, a biomaterialprop table with year data could have a biomaterial.value of "2011" and then link to a cvterm table with cvterm.name "input_year", linked to a cv table with cv.name "year". Is this correct?

I'd like to associate a second accession number to my data (the GenBank accession, if available). I'm planning to use the dbxrefprop table with a cvterm akin to accession number. Would this be the best way to accomplish this objective?

If a linking table requires input from a cvterm, how do you make the assignment? For example, the feature_genotype table requires a non NULL cvterm input, which shouldn't be necessary.

How do you list a feature location that is discontinuous (frameshift during translation) in the featureloc table if the only relevant inputs are fmin and fmax? Do you create two separate entries, or just have the feature table with the translated AA sequence be the sole reflection of the frameshift?

bradfordcondon commented 6 years ago

CVs and CVterm: Take a look at the Chado CV module for some background help.

f I'm understanding everything correctly, a biomaterialprop table with year data could have a biomaterial.value of "2011" and then link to a cvterm table with cvterm.name "input_year", linked to a cv table with cv.name "year". Is this correct?

Close, but no. The biomaterialprop.value is 2011. The biomaterialprop links to the biomaterial via the biomaterialprop.biomaterial_id column, and to the CVterm via biomaterialprop.type_id (which is the cvterm id). cvterm.name is year. Not sure where you are getting input_year from? The term you linked comes from the SIO(semantiscience Integrated Ontology). So SIO info goes in the CV table.

your cvterm columns:

cvterm_id : n
cv_id : foreign key to the SIO entry
definition: "A year is a period of time taken by a planet to make one revolution around the sun."
dbxref_id: this points to the link made between DB and CV. If you use the tripal OBO or cvterm importer, you wont need to set/understand this. The short story is that the relevant columns here will be the foreign key to the SIO DB entry for db_id and accession will be 000428. You can refer to my guide for more in the db/cv thing but dont get hung up on it IMO.

So your CV for SIO looks like this:

cv_id : n
name: SIO
definition: SemanticScience Integrated Ontology.

I'd like to associate a second accession number to my data (the GenBank accession, if available). I'm planning to use the dbxrefprop table with a cvterm akin to accession number. Would this be the best way to accomplish this objective?

Again, not quite. Everything you want for just associating something with an accession is handled via the dbxref table and the appropriate linker table. So the DBXref is going to hold the DB and Accession on the corresponding columns. DB= GenBank, Accession = the accession number. Then there are linker tables to link a content type to that accession number. For example, biomaterial_dbxref or feature_dbxref. These just take the foreign keys of the dbxref and what you are linking.
DBxrefprop is to attach additional metadata to the dbxref! For example: If your dbxref has a year attached, you could assign the year by adding this DBXREF prop (also see the link):

dbxref_id : Points to dbxref you are assigning the prop to, in this case, the year
type_id: points to the cvterm id in the cvterm table. In this case, the term for year.
value: the year (2011)

If a linking table requires input from a cvterm, how do you make the assignment? For example, the feature_genotype table requires a non NULL cvterm input, which shouldn't be necessary.

I haven't messed with this ever so im not sure, and im not finding the wiki too insightful.

How do you list a feature location that is discontinuous (frameshift during translation) in the featureloc table if the only relevant inputs are fmin and fmax? Do you create two separate entries, or just have the feature table with the translated AA sequence be the sole reflection of the frameshift?

Again havent messed with this, but reading the table description:

Note on multiple locations: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested).

I'm actually confused about your use case though. Your parent feature is a gene and your child feature are two alternate transcripts? Or your parent feature is the gene and your child feature is the frameshift mutation?

btski commented 6 years ago

Thanks for that detailed explanation @bradfordcondon.

I was under the impression that cvterm and cv were basically redundant information where cvterm was human assigned and cv was linked to an established ontology. It makes more sense that cv describes the source of the ontology and cvterm describes the ontology details.

I felt that the secondary accession would work best within the dbxref table, but I was unsure as to how everything would fit together. If I can link multiple foreign keys at a time it will work well. Otherwise I'll have to link everything twice over for data with two distinct accessions (not a big deal now that I think about how SQL exclusively stores unique data without duplicates).

I'm currently working with PRRSV, a small, single-stranded RNA virus. For my use case, I was going to have the RNA sequence/genome as one feature and translated AA sequences (from open reading frames and non-structural proteins) as secondary features. I am going to include feature locations for all translated AA sequences, but some of the AA sequences are discontinuous due to frameshifts. The genome experiences high mutation rates and translational frameshifts occur, so I'm looking for a way to express where frameshifts are located. The featureloc table doesn't seem to accommodate for frameshifts well. From what I can tell, the only way I can represent frameshifts is indirectly, by allowing people to infer a frameshift based on differences between the coding RNA sequence and the translated AA sequence.

bradfordcondon commented 6 years ago

OK great so you've got 2 sets of features, the transcript and the derived polypeptide. Our site does the same thing.

Would it work to include the frameshift info as featureprops? We do this with SSRs: for example, https://hardwoodgenomics.org/feature/WO454_contig12027_v2?pane=predicted_ssr . In this example each featureprop has the key = > value pair of for example SSR start => 23

btski commented 6 years ago

It seems a bit strange to house the start/end points of polypeptides and corresponding frameshift locations in separate tables, but I agree that's probably the best way to categorize the data within the Chado schema. I can always reassemble the locations in a more meaningful way after performing an SQL query. To demonstrate my interpretation of your suggestion with an example: ORF1ab coding range is from 192-7700 and 7700-12073 (single nucleotide is read twice) featureloc.fmin = 192 featureloc.fmax = 12073 featureprop.value = 7700, 7700 Query output = 192+"-"+7700, 7700+"-"+12073 = 192-7700, 7700-12073

Thanks again for the help. By the way, your site looks great. Love the clean/simple design. That's basically how I hope to implement my own Tripal site.

bradfordcondon commented 6 years ago

Apologies for not being in a viral frame of mind.

I'd definitely have the parent feature be the RNA. You can then associate multiple polypeptide sequences with that parent RNA feature. You want to indicate where the frameshift is occurring somehow.

I'm not sure why you cant say 192-12073 is the coding range. You could then add a featureprop where the cvterm = frameshift 2x read (hopefully a less made up cvterm :) ), value = 7700 and associate it with the polypeptide with the frameshift. Or put another way, im not sure what you accomplish by splitting it up and having the featureprop 2x.

Anyway, I'd say double check the feature locations information but otherwise sounds reasonable.

btski commented 6 years ago

I think I should be able to represent each frameshift as a single featureprop value. I just need to find a character to separate the numeric values for parsing into a more legible format on query. Not sure what the character restrictions would be (seems to permit numbers/letters but perhaps not special characters).

Last question: Would it be feasible to consolidate all data into a tab delimited file and load everything with the bulk loader, or would you advise using the FASTA loader for the sequencing data? I'm going to have to restructure my data anyway to get it into either FASTA or tab delimited format.

bradfordcondon commented 6 years ago

Either approach should be possible. I haven't seen your delimited data but I would probably load in the fasta first and the metadata after via the bulk loader. So...

Load all features using the Chado FASTA loader.
Load additional annotations using the bulk loader.

That way you can keep the data you are importing via the bulk loader more forward as TSV/CSV. (which I guess you can do with sequences in there but I hate having sequence data in column format). You just have to concern yourself with the unique name of the parent feature (the RNA genome in your case) and link everything to that.

Stephen has done some thorough demos on how to use the bulk loader which should be on youtube somewhere? Also its well documented in the users guide - tripal 3 link but you can find the equivalent tripal 2 link as well.

btski commented 6 years ago

Thanks for the suggestion. I'll try using the two separate file formats to preserve file readability.

bradfordcondon commented 6 years ago

No problem. Good luck testing your site out, feel free to reach out again here or on Slack

spficklin commented 6 years ago

Thanks @bradfordcondon for your input! I'm going to close this out, but if there is still discussion @btski please reopen.

tripal / tripal

Fitting data to Chado schema #185