terraref / reference-data

Coordination of Data Products and Standards for TERRA reference data
https://terraref.org
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

Upload 2017 MAC sorghum season 4 field measurements to BETYdb #174

Closed NewcombMaria closed 5 years ago

NewcombMaria commented 7 years ago

@dlebauer and @gsrohde I'm ready to start uploading some of the field measurements from the 2017 MAC sorghum season (April - September). The cultivar-site table needs a couple revisions to the cultivar identities. As the plants approached their height potential we realized that 2 accessions (cultivars) were incorrectly identified, based on heights from Clemson and in GRIN. We don't know the correct identities. The appropriate cultivar name probably is 'Unknown' or something similar. The following 4 sites need to be re-assigned to 'Unknown' (or 'Unknown accession') instead of PI329440 and PI330807.

PI329440 MAC Field Scanner Season 4 Range 9 Column 14 PI329440 MAC Field Scanner Season 4 Range 32 Column 10 PI330807 MAC Field Scanner Season 4 Range 3 Column 5 PI330807 MAC Field Scanner Season 4 Range 44 Column 2

This issues overlaps with #126 . It's much easier to start with correct cultivar-site descriptions instead of having 2 sets of spreadsheets with and without revisions. Let me know if you need more info to make the changes.

NewcombMaria commented 7 years ago

@dlebauer I need to revise my previous comment. Please change BAP cultivar PI329440 to "Unknown accession' in sites MAC Field Scanner Season 4 Range 9 Column 14 and Field Scanner Season 4 Range 32 Column 10 as requested above, but PI330807 might not be incorrect. It might only be a mixture of different gentoypes with some plants that are 'off-types'. For now it should remain as it is in the cultivar-sites table.

dlebauer commented 7 years ago

@gsrohde can you please update these records in the database? The correct cultivar name is 'Unknown off type'

NewcombMaria commented 7 years ago

Yes, thanks, @gsrohde the only accession that needs to be changed is PI329440, both replicates: PI329440 MAC Field Scanner Season 4 Range 9 Column 14 PI329440 MAC Field Scanner Season 4 Range 32 Column 10 Updated to 'Unknown off type'.

NewcombMaria commented 7 years ago

@dlebauer and @gsrohde the 2017 season 4 experimental design has height blocks (tall, medium, short). It's a row-column design with further constraint that accessions (cultivars) are blocked by height classes. How should the height classes be handled in BETYdb? Is it part of the site/cultivar information? The height blocks extend across ranges (4 ranges are short, 10 are medium, and the rest are tall).

dlebauer commented 7 years ago

These can be stored as distinct experiments. If you can give a list of ranges within each block, I can add these

NewcombMaria commented 7 years ago

Thanks @dlebauer . Height block info is in the tab FINAL_DESIGN column C in this spreadsheet from issue #126 https://docs.google.com/spreadsheets/d/1iUTkyU5P3IfohQg-sLq_ETWvaWBHF5Nvchx3iI8C8Qw/edit#gid=1908904980 Or by range number: Short: ranges 11, 20, 46, 50 Medium: ranges 10, 12, 18, 24, 27, 29, 31, 33, 38, 51 All other ranges are either tall or border plots.

NewcombMaria commented 7 years ago

@gsrohde and @dlebauer I have several 2017 field measurement data sets ready to upload. As soon as the site/cultivar/height-block records are revised I'll start uploading. Has there been any changes to the ability to link Methods to data-sets? Do I still keep track of which Methods go with which data sets, or can it now be linked during the BETYdb upload?

gsrohde commented 7 years ago

@NewcombMaria, @dlebauer

OK, I ran this on bety6's bety database:

UPDATE sites_cultivars
   SET cultivar_id = (SELECT id FROM cultivars 
                                    WHERE name = 'Unknown off type')
   WHERE site_id IN (SELECT id FROM sites
                        WHERE sitename ~ 
                        '^MAC Field Scanner Season 4 Range (9 Column 14|32 Column 10)( (E|W))?');

Checking:

SELECT
    s.sitename AS "site name",
    c.name AS "cultivar name"
FROM
    sites s
            JOIN sites_cultivars sc ON sc.site_id = s.id
            JOIN cultivars c ON c.id = sc.cultivar_id
WHERE
    c.name = 'Unknown off type';

results in

                    site name                    |  cultivar name   
-------------------------------------------------+------------------
 MAC Field Scanner Season 4 Range 9 Column 14    | Unknown off type
 MAC Field Scanner Season 4 Range 32 Column 10   | Unknown off type
 MAC Field Scanner Season 4 Range 9 Column 14 E  | Unknown off type
 MAC Field Scanner Season 4 Range 9 Column 14 W  | Unknown off type
 MAC Field Scanner Season 4 Range 32 Column 10 E | Unknown off type
 MAC Field Scanner Season 4 Range 32 Column 10 W | Unknown off type
(6 rows)
dlebauer commented 7 years ago

πŸ‘ πŸ’― πŸ‘ '^MAC Field Scanner Season 4 Range (9 Column 14|32 Column 10)( (E|W))?')

dlebauer commented 7 years ago

@NewcombMaria you should be ready to go !(?)

I added four records to the experiments table that group the border, small, medium, and large plots https://gist.github.com/dlebauer/f083056af7a835c594a9167373c37773

Then associated them with the appropriate plots / ranges by creating the insert statements in the experiments_sites table of the spreadsheet: https://docs.google.com/spreadsheets/d/1iUTkyU5P3IfohQg-sLq_ETWvaWBHF5Nvchx3iI8C8Qw/edit#gid=552329304

NewcombMaria commented 7 years ago

@gsrohde and @dlebauer thanks very much!! I'll start uploading some of the 2017 data sets later today.

NewcombMaria commented 7 years ago

@gsrohde and @dlebauer turns out I need your help already. I'm trying to upload a spreadsheet for one part of height data. Two questions: Question 1) I made a new 2017 citation, is that the wrong way to use citations? I'm getting the error message: Data Value Errors

I think the site names are correct. The spreadsheet is in this 2017 upload google folder https://drive.google.com/drive/folders/0B1OR5-zeXagnMlpDQjBpRDlYU3M

Question 2) The 'experiment' column with Tall, Medium, and Short blocks is being ignored. Is that ok since the experiments are already linked to sites in the table? Or do I need to do something different?

dlebauer commented 7 years ago

1) I think that the requirement that sites and citations be linked is a bug - this could be done automatically with the bulk upload. Since this is pre-publication I don't think that at this point there is a need to have separate citations for 2016 and 2017 data. The objective for using the citations table for unpublished data is to track who collected the original data, who can be contacted with questions, and who should be contacted when they are used.

2) the experiments are associated with the sites already, so these don't need to be uploaded.

NewcombMaria commented 7 years ago

@dlebauer apparently the new 2017 citation wasn't the problem. I tried the bulk upload for the same spreadsheet but this time with the original 2016 citation and still get the same error message: Data Value Errors

Is there a step that needs to be done using the API to link the new sites (season 4 sites) to existing citations before I can use the bulk upload web interface?

dlebauer commented 7 years ago

@NewcombMaria i've updated the records (and also created an issue to fix this error in pecanproject/bety#539). Please retry and let me know if this works.

NewcombMaria commented 7 years ago

Thanks @dlebauer . The site-citation is fixed and I'm able to make it past the first steps up to 'Confirm data' but get stopped by the message the treatment 'control' is not found in the treatments table.
Not found in treatments table Specify Data-Set Wide Values treatment control

Should I be specifying a different treatment? Or can 'control' be added to the treatment table? I think we are getting close... Thanks for your help.

dlebauer commented 7 years ago

I just added a treatment called 'Control BAP 2017' that should work: ' https://terraref.ncsa.illinois.edu/bety/treatments/6000000023

Have you already added all of the managements (planting date, fertilization, etc)?

On Tue, Sep 26, 2017 at 11:44 AM MariaNewcomb notifications@github.com wrote:

Thanks @dlebauer https://github.com/dlebauer . The site-citation is fixed and I'm able to make it past the first steps up to 'Confirm data' but get stopped by the message the treatment 'control' is not found in the treatments table.

Not found in treatments table Specify Data-Set Wide Values treatment control

Should I be specifying a different treatment? Or can 'control' be added to the treatment table? I think we are getting close... Thanks for your help.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/reference-data/issues/174#issuecomment-332261247, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX5wOvxtlBRH7BFimNaCISKBlEz7oDks5smSnvgaJpZM4PNgMB .

NewcombMaria commented 7 years ago

@dlebauer I tried again with 'Control BAP 2017' but get the same return 'Not found in treatments table'. Anything starting with Control gives the pop-down 'No matches'.

I've added some of the management activities but not all. I'll try to complete the management additions soon.

dlebauer commented 7 years ago

How about the name "BAP 2017"? I just added this one - perhaps try refreshing the browser?

NewcombMaria commented 7 years ago

I logged out and back in and tried again with both 'BAP 2017' and 'Control BAP 2017' and also 'Control' and still get 'Not found in treatments table. @dlebauer any other suggestions? Thanks for your help.

NewcombMaria commented 7 years ago

If I add a new treatment myself under the 2017 citation called 'Control BAP 2017' with Control=True would that work or would that make problems relating the sites to other data like sensor-derived data?

dlebauer commented 7 years ago

At this point the name doesn't matter - I'd be curious if it makes a difference if you make the treatment, but this is the one that I made: https://terraref.ncsa.illinois.edu/bety/treatments/6000000023 and you can feel free to rename it if that helps.

NewcombMaria commented 7 years ago

I successfully made it past the treatment designation and on to the next error during the Insert data step: PG::RaiseException: ERROR: The value of cultivar_id (6000000993) is not consistent with the value 6000000212 specified for site_id 6000005356. : INSERT INTO "traits" ("access_level", "checked", "citation_id", "created_at", "cultivar_id", "date", "date_day", "date_month", "date_year", "dateloc", "entity_id", "mean", "method_id", "n", "notes", "site_id", "specie_id", "stat", "statname", "time", "time_hour", "time_minute", "timeloc", "treatment_id", "updated_at", "user_id", "variable_id") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27) RETURNING "id"

Is it only one cultivar 6000000993 that is not matched with the correct site? Which cultivar and which site?

To get beyond the treatment specification I created 'Control BAP 2017'. It seems to make a difference that I'm the Record Creator: https://terraref.ncsa.illinois.edu/bety/treatments/6000000024

dlebauer commented 7 years ago

What is the cultivar name that you have in the site "MAC Field Scanner Season 4 Range 27 Column 2"?

dlebauer commented 7 years ago

PS for each of these you can identify which cultivar and site is being referred to by appending the id to the url with the table name, e.g.

NewcombMaria commented 7 years ago

The spreadsheet has 'MAC Field Scanner Season 4 Range 27 Column 2' site with cultivar 'PI641807'.
The other replicate of PI641807 is at 'MAC Field Scanner Season 4 Range 10 Column 2'. The common name for PI641807 is ATLAS, which is confusing and has been discussed in a couple different issues (see for example comment May 26 in https://github.com/terraref/reference-data/issues/126). PI641807 is correct for 2017 season 4. ATLAS also came up in one of the KSU github issues. ATLAS may have a different PI identity in KSU fields.

NewcombMaria commented 7 years ago

The other github issue that discusses ATLAS is #65 , see comments Aug 11 to 14. It may have been merged with an incorrect PI number (PI586537).

dlebauer commented 7 years ago

I've updated it. Not sure where the error came from, but we do have the following records for "ATLAS":

id name ecotype notes
6000000212 PI586537 Sweet Sorghum Commercial Sweet Sorghum Hybrid. Common name ATLAS
6000000993 PI641807 BAP Commercial Sweet Sorghum Hybrid. Common name ATLAS

From that comment, https://github.com/terraref/reference-data/issues/126#issuecomment-304304105 it sounds like these should be kept separate since they are from different seed lots.

NewcombMaria commented 7 years ago

The file was successfully uploaded! Thanks very much. It looks like we're on track now! Next as soon as I have a chance I'll try uploading a file with subplots.

NewcombMaria commented 7 years ago

Sorry, apparently we're not quite on track yet. Uploading the second part of the same data set gives the error: PG::RaiseException: ERROR: The value of cultivar_id (6000000993) is not consistent with the value 6000000212 specified for site_id 6000005672. : INSERT INTO "traits" ("access_level", "checked", "citation_id", "created_at", "cultivar_id", "date", "date_day", "date_month", "date_year", "dateloc", "entity_id", "mean", "method_id", "n", "notes", "site_id", "specie_id", "stat", "statname", "time", "time_hour", "time_minute", "timeloc", "treatment_id", "updated_at", "user_id", "variable_id") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27) RETURNING "id"

I tried searching on site 6000005672 but get ' no entries found'

dlebauer commented 7 years ago

Can you link to the file you are trying to upload?

On Tue, Sep 26, 2017 at 5:03 PM MariaNewcomb

< mailto:MariaNewcomb notifications@github.com

wrote:

a, pre, code, a:link, body { word-wrap: break-word !important; }

Sorry, apparently we're not quite on track yet. Uploading the second part of the same data set gives the error:

PG::RaiseException: ERROR: The value of cultivar_id (6000000993) is not consistent with the value 6000000212 specified for site_id 6000005672. : INSERT INTO "traits" ("access_level", "checked", "citation_id", "created_at", "cultivar_id", "date", "date_day", "date_month", "date_year", "dateloc", "entity_id", "mean", "method_id", "n", "notes", "site_id", "specie_id", "stat", "statname", "time", "time_hour", "time_minute", "timeloc", "treatment_id", "updated_at", "user_id", "variable_id") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27) RETURNING "id"

I tried searching on site 6000005672 but get ' no entries found'

β€”

You are receiving this because you were mentioned.

Reply to this email directly, https://github.com/terraref/reference-data/issues/174#issuecomment-332349588 , or https://github.com/notifications/unsubscribe-auth/AAcX55YsgkcKas24uyx_x-3pyuWDoXRoks5smXSxgaJpZM4PNgMB .

NewcombMaria commented 7 years ago

There are 2 files at https://drive.google.com/drive/folders/0B1OR5-zeXagnMlpDQjBpRDlYU3M The one ending in _a was uploaded successfully. The one ending in _b gave the error message.

NewcombMaria commented 7 years ago

It looks like we have PI641807 in 4 sites in the field. All other cultivars should only be at 2 sites (2 reps). Looking back at the original entry list we received from Clemson, we received seed identified as ATLAS and seed identified as PI641807. We planted 2 reps of each and then later were told that the seed source of ATLAS sent to us is also identified PI641807.

NewcombMaria commented 7 years ago

@dlebauer or @gsrohde if you have any time today to update the problem records then I'll upload field measurement data this evening

gsrohde commented 7 years ago

@NewcombMaria What exactly is the error you get when attempting the _b file upload? Does it happen on the validation page or right when you click the "Insert Data" button? Which citation where you using for the upload?

NewcombMaria commented 7 years ago

I get the error copied below, which is very similar to an error I originally had with the file ending in _a except different site_id. I think we tracked and fixed the problem that caused the error with the file ending in _a by changing 2 of the sites to PI641807. There may be 2 additional sites that need to be similarly changed. Let me know if this needs more explanation. Thanks!

PG::RaiseException: ERROR: The value of cultivar_id (6000000993) is not consistent with the value 6000000212 specified for site_id 6000005672. : INSERT INTO "traits" ("access_level", "checked", "citation_id", "created_at", "cultivar_id", "date", "date_day", "date_month", "date_year", "dateloc", "entity_id", "mean", "method_id", "n", "notes", "site_id", "specie_id", "stat", "statname", "time", "time_hour", "time_minute", "timeloc", "treatment_id", "updated_at", "user_id", "variable_id") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27) RETURNING "id"

NewcombMaria commented 7 years ago

@gsrohde the citation I'm using:
Newcomb, Maria | 2017 | Maricopa Agricultural Center Field Measurements and Activities

I get the error at the Insert Data step

gsrohde commented 7 years ago

This error occurs because site "MAC Field Scanner Season 4 Range 31 Column 13" (6000005672) is associated with cultivar "PI586537" (6000000212), but your upload file is trying to assign cultivar "PI641807" (6000000993) to a trait at that site.

So if the cultivar growing at site "MAC Field Scanner Season 4 Range 31 Column 13" really is "PI641807", then I'll correct the sites_cultivars table.

NewcombMaria commented 7 years ago

Thanks @gsrohde . Yes the correct cultivar for 4 sites total is PI641807. Yesterday 2 of the sites were corrected which allowed the file ending in _a to successfully upload. 'MAC Field Scanner Season 4 Range 31 Column 13' and also 'MAC Field Scanner Season 4 Range 29 Column 8' should both be associated with cultivar "PI641807" (6000000993). Thanks very much!

NewcombMaria commented 7 years ago

@gsrohde or @dlebauer I tried again to upload the file Sorghum_season4_Heights_Fullfield_plots_b and am stopped with the same error: PG::RaiseException: ERROR: The value of cultivar_id (6000000993) is not consistent with the value 6000000212 specified for site_id 6000005672.

Have the records in the sites_cultivars table been revised? The correct association should be cultivar PI641807 cultivar_id (6000000993) at the following sites: MAC Field Scanner Season 4 Range 27 Column 2 MAC Field Scanner Season 4 Range 27 Column 2 W
MAC Field Scanner Season 4 Range 27 Column 2 E
MAC Field Scanner Season 4 Range 31 Column 13 MAC Field Scanner Season 4 Range 31 Column 13 W
MAC Field Scanner Season 4 Range 31 Column 13 E

Let me know if/when the table has been updated and I'll try again.

dlebauer commented 7 years ago

Only "MAC Field Scanner Season 4 Range 31 Column 13" needed to be updated

select cultivar_id, sitename, site_id from sites_cultivars join sites on sites_cultivars.site_id = sites.id where sitename like 'MAC Field Scanner Season 4 Range 27 Column 2%' or sitename like 'MAC Field Scanner Season 4 Range 31 Column 13%'

update sites_cultivars set cultivar_id = 6000000993 where cultivar_id = 6000000212 and site_id = 6000005672

@NewcombMaria The file Sorghum season4_Heights_FullField_plots_b (1).csv is ready to be uploaded. It contains all of the data except a single row (see below).


@gsrohde please check why in Sorghum season4_Heights_FullField_plots_b (1).csv throws an error. The cultivars - sites relationship appears valid

select cultivar_id, sitename, name, site_id from sites_cultivars join sites on sites_cultivars.site_id = sites.id join cultivars on sites_cultivars.cultivar_id = cultivars.id where sitename like 'MAC Field Scanner Season 4 Range 42 Column 10'

returns

6000000806 MAC Field Scanner Season 4 Range 42 Column 10 PI329665 6000005886

NewcombMaria commented 7 years ago

I'll try uploading now. If all of the sites including plot and subplots (E/W) then the error might be that I had one cell with the text 'no data'. I'll let you know if the upload works. Thanks.

NewcombMaria commented 7 years ago

@dlebauer and @gsrohde I'm happy to report success with uploading Sorghum_season4_Heights_Fullfield_plots_b . Thanks for revising the sites-cultivar table! Next I'll upload height data sets that have sites designated at the subplot level (E/W rows)

dlebauer commented 7 years ago

@NewcombMaria per our conversation I deleted all of the records that you uploaded today (panicle_height and canopy_height)

delete from traits where user_id = 6000000017 and traits.created_at > CURRENT_DATE;
NewcombMaria commented 7 years ago

@dlebauer many thanks! I'll re-upload the data with the correct use of site/entity.

NewcombMaria commented 7 years ago

All the height data are now uploaded, along with stand count data from manual counts early in the season after plot thinning. I'm keeping a list of csv files successfully uploaded with date/time in a document in the directory https://drive.google.com/drive/folders/0B1OR5-zeXagnMlpDQjBpRDlYU3M

gsrohde commented 7 years ago

@dlebauer Be sure, when you use delete statements such as the above, that you keep in mind that traits.created_at > CURRENT_DATE grabs all traits created today UTC. That will include traits created yesterday after 5PM Mountain Standard Time.

dlebauer commented 7 years ago

Thanks for the reminder Scott. I did check using select that I only got the records that Maria had uploaded.

NewcombMaria commented 7 years ago

Seed drop during planting (all plots) and seedling emergence counts and emergence rates over multiple dates from april 26 to may 9 (subset of plots) are now uploaded

NewcombMaria commented 7 years ago

I added field management activities for season4. Since there is not an option for selecting 'other' under 'Management Type' I was limited to options in the drop-down list. For plant thinning completed May 10, 11, and 12 the best fit for 'Management Type' seemed to be 'soil disturbance'.