terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
23 stars 13 forks source link

Uploading BETY trait CSVs from Google Drive #582

Closed max-zilla closed 5 years ago

max-zilla commented 5 years ago

@kimberlyh66 during our last NCSA meeting I told @dlebauer I would attempt to upload these BETYdb CSVs to BETY that were uploaded by @ZongyangLi , but if there were issues with uploading them he suggested asking you for assistance.

https://drive.google.com/drive/folders/1Y-Qdxe1GgCgXSxR0KFEeyIVyoQyv-tCX There are 3 directories in this Google Drive folder with .tar files containing daily CSVs for BETY upload:

ARCHIVE               -- TRAIT_COLUMN_NAME(S)
s4_98th_height.tar.gz -- 98th_quantile_canopy_height
s6_98th_height.tar.gz -- 98th_quantile_canopy_height

s4Panicle_BETY.zip    -- panicle_counting, panicle_volumn_median, 
                         panicle_surface_are_median
S6PanicleBETY.tar.gz  -- panicle_counting, panicle_volumn_median, 
                         panicle_surface_are_median

s4_leaf_angle.tar.gz  -- leaf_angle_alpha_src, leaf_angle_beta_src, 
                         leaf_angle_alpha_fit, leaf_angle_beta_fit, 
                         leaf_chi_src, leaf_chi_fit
s6_leaf_angle.tar.gz  -- leaf_angle_alpha_src, leaf_angle_beta_src, 
                         leaf_angle_alpha_fit, leaf_angle_beta_fit, 
                         leaf_chi_src, leaf_chi_fit

I wrote a small Python script to iterate over the daily CSVs and push them to BETY with some key snippets here:

BETY_URL = "https://terraref.ncsa.illinois.edu/bety/api/v1/traits"
BETY_KEY = "<SECRET>"

def submit_traits(csv):
    resp = requests.post("%s.%s" % (BETY_URL, 'csv'), 
                         params={ 'key':BETY_KEY },
                         data=file(csv, 'rb').read(),
                         headers={'Content-type': 'text/csv'})

...however, none of the CSVs were successfully uploaded. A line from my logfile for each file:

/Users/mburnette/Downloads/BETYdbUploads/s4_98th_height/2017-05-06_98th_quantile.csv,No trait variable was found in the CSV file.
/Users/mburnette/Downloads/BETYdbUploads/s6_98th_height/2018-05-18_98th_quantile.csv,No trait variable was found in the CSV file.
/Users/mburnette/Downloads/BETYdbUploads/s4BetyLeafAngle/2017-08-15_betaD.csv,No trait variable was found in the CSV file.
/Users/mburnette/Downloads/BETYdbUploads/s6leafAngleBety/2018-05-19_betaD.csv,No trait variable was found in the CSV file.
/Users/mburnette/Downloads/BETYdbUploads/s4Panicle_BETY/2017-07-20_panicle.csv,No trait variable was found in the CSV file.
/Users/mburnette/Downloads/BETYdbUploads/s6PanicleBETY/2018-06-16_panicle.csv,No trait variable was found in the CSV file.

I'm assuming perhaps we need some trait defined in bety that corresponds with the column names I listed above that don't exist yet? We've successfully uploaded other bety data such as CanopyCover with similar CSVs and the "No trait variable..." error message was coming from BETY with a 400 response on the post.

Please let me know if you might be able to look into this and how I can help.

dlebauer commented 5 years ago

@ZongyangLi can you please define the variables and methods associated with these data?

ZongyangLi commented 5 years ago

Added methods by the following link: https://terraref.ncsa.illinois.edu/bety/methods/new

Scanner 3d ply data to 98th quantile height Scanner 3d ply data to leaf angle distribution Scanner 3d ply data to panicle counting

Added variables by the following link: https://terraref.ncsa.illinois.edu/bety/variables/new

98th_quantile_canopy_height leaf_angle_alpha_src leaf_angle_beta_src leaf_angle_alpha_fit leaf_angle_beta_fit leaf_chi_src leaf_chi_fit panicle_counting panicle_volumn_median panicle_surface_area_median

Error operation: Added 98th_quantile_canopy_height to methods, please delete it in methods

dlebauer commented 5 years ago

@ZongyangLi thanks for doing this ... should the trait associated with ‘Scanner 3D ply to 98th quantile height` be associated with the trait ‘canopy_height’? More specifically, if using the 98th quantile of the point cloud is intended to reflect the actual canopy height, then do we need a separate variable?

Similarly, if the best estimate of the panicle_volume is the median, then it would make sense call the trait ‘panicle_volume’ and describe the method of estimation in the methods (same for surface_area). And I am not sure what the difference is between _src and _fit but I suspect that these can also be differentiated in the methods rather than in the variable itself.

And to clarify - are you requesting that I delete the 98th_quantile_canopy_height method? I can do that although if you added it you should be able to delete it (as long as there aren’t any data already associated with the method).

abby621 commented 5 years ago

We've proposed a new naming scheme, listed as [Variable Name, Method].

@dlebauer Does this fit your naming convention?

Change 98th_quantile_canopy_height to [ Canopy Height, 3D_scanner_98th_quantile]

Change Leaf angle variables from: leaf_angle_alpha_src leaf_angle_beta_src leaf_chi_src leaf_angle_alpha_fit leaf_angle_beta_fit leaf_chi_fit

to: [ Leaf Angle Mean, 3D_scanner_leaf_angle_distribution] [ Leaf Angle Variance, 3D_scanner_leaf_angle_distribution] [ Leaf Angle Alpha, 3D_scanner_leaf_angle_distribution] [ Leaf Angle Beta, 3D_scanner_leaf_angle_distribution] [ Leaf Angle Chi, 3D_scanner_leaf_angle_distribution]

And for panicles change from: panicle_counting panicle_volume_median panicle_surface_area_median to: [Panicle Count, 3D_scanner_panicle_count]
[Panicle Volume, 3D_scanner_panicle_volume_median]
[Panicle Surface Area, 3D_scanner_surface_area_median]

Additionally, the leaf length and width parameters would have the following variables and methods:

[leaf_length, 3D_scanner_geodesic_kalman] [leaf_length, 3D_scanner_geodesic_unfiltered] [leaf_width, 3D_scanner_geodesic_kalman] [leaf_width, 3D_scanner_geodesic_unfiltered]

Do those naming conventions for variables and methods seem to be more consistent?

dlebauer commented 5 years ago

Hi Abby - this is definitely on the right track, but I have a few thoughts and it will be easier to flush this out in this spreadsheet where we can capture the other information like descriptions, units, citations, etc.

A few notes -

Method names It might make sense to include something about the algorithm used (like where 'kalman' is used) rather than just saying '3D Scanner Panicle Volume' which doesn't allow it to be differentiated from another algorithm.

Variable names The variable naming convention loosely follows the structure of CF (Climate Forecast) standard names are constructed ... you can see examples here. And are thus snake_case. Method names don't have such constraint so can be typed like the title of a protocol.

Statistics

The leaf_angle_mean and leaf_angle_variance present a special case since BETYdb is designed to store the mean values alongside (optionally the sample size and a statistic, so the appropriate name for the mean leaf angle would be leaf_angle and each of these values can either standalone or be stored with a statistic. It would still be okay to have leaf_angle_variance alongside leaf_angle_beta etc, but there is also the option of including columns 'stat', 'statname' and 'n'. For now lets ignore n because that gets confusing. Unfortunately we only store one statistic for each record or else we could treat alpha and beta in the same way.

Also on the topic of variance. Does the variance you are computing have the same units as the mean? Would it make sense to call this 'Standard Deviation'?

As a footnote, I'll reference this lengthy discussion where I think we concluded that we would fit the normal and beta distributions separately, such that, e.g., mean != alpha/(alpha+beta)); if these values end up being equal then we should reconsider only storing one or the other set of parameters or else analyses that include both traits might have numerical issues.

abby621 commented 5 years ago

Hi David - I don't currently have permission to edit that google sheet. If you grant it, I can fill things out there, but in the meantime, I'll reply in line here. I've gone through and edited our variables and methods to reflect your comments (snake case for variables, descriptive for methods, adding in algorithm details where appropriate). If you're on board with these changes, then @ZongyangLi can implement them.

Change 98th_quantile_canopy_height to [ canopy_height, 3D scanner to 98th quantile height]

Change Leaf angle variables from:
 leaf_angle_alpha_src leaf_angle_beta_src 
leaf_chi_src
 leaf_angle_alpha_fit 
leaf_angle_beta_fit 
leaf_chi_fit

to:


[ leaf_angle_mean (+ leaf_angle_variance stored stored alongside as statistic), 3D scanner to leaf angle distribution] 
[ leaf_angle_alpha, 3D scanner to leaf angle distribution] 
[ leaf_angle_beta, 3D scanner to leaf angle distribution] 
[ leaf_angle_chi, 3D scanner to leaf angle distribution]

And for panicles change from: 
panicle_counting 
panicle_volume_median 
panicle_surface_area_median

to:


[panicle_count, 3D scanner to panicle count faster_rcnn + roughness treshold + convex hull]
 [panicle_volume, 3D scanner to panicle volume faster_rcnn + roughness treshold + convex hull]
 [panicle_surface_area, 3D scanner to panicle surface area faster_rcnn + roughness treshold + convex hull]

Additionally, the leaf length and width parameters would have the following variables and methods:

[leaf_length, 3D scanner to leaf measurements kalman] [leaf_length, 3D scanner to leaf measurements unfiltered] 
[leaf_width, 3D scanner to leaf measurements kalman] 
[leaf_width, 3D scanner to leaf measurements unfiltered]

Regarding the leaf angle variance, @ZongyangLi is currently saving the variance, but we could obviously compute standard deviation is that were the preferred measurement?

ZongyangLi commented 5 years ago

@dlebauer @abby621

Files updated to here in the sub directory: https://drive.google.com/open?id=1Y-Qdxe1GgCgXSxR0KFEeyIVyoQyv-tCX

Example leaf angle csv file: https://drive.google.com/open?id=10awD6-suq49L_TGI0x5Q3L-jSJvFmlBX

If we all agree with the current definition of methods and variables, I could add those to BETY.

dlebauer commented 5 years ago

@abby621 you should have access to the google doc if you want to update the records there. then @kimberlyh66 can upload the data and we will be on our way!

abby621 commented 5 years ago

@dlebauer We have the spreadsheet almost entirely filled, but have a question regarding the min/max values. Should that be the min/max that we've ever seen, or some sort of bound on the possible reported values? I'm not sure that we know what that should be -- our algorithms don't specify particular min/max values beyond what's specified by the datatype (so a leaf could technically be hundreds of meters long, even if we would never expect to observe that).

dlebauer commented 5 years ago

@abby621 consider these to be very broad uniform priors that set upper and lower bounds on what data should be considered 'valid'. If they fall outside of the range they will be rejected. Then we can always update the min/max values if they should not be rejected.

So, these should be set so that they provide a high level constraint on valid values - most variables have a lower bound at 0; some have upper bounds at 1 or 100 by definition. The longest leaf in the world is 25m long so we could set max at 25000mm, or we could go with something like 2m which is more reasonable for Sorghum (and wheat). For leaf angle, if in degrees then I think the valid range would be [0,90]? In many cases we have -inf,inf, but these aren't very useful.

ZongyangLi commented 5 years ago

I have already filled in the sheet and update the new methods name and variables in csv file, can we go ahead and get it uploaded now?

max-zilla commented 5 years ago

OK, I will try to upload in the morning after downloading new CSV files. We must make sure they are in BETY as well. We can ask @kimberlyh66 to add the new / updated names to BETY and I can upload the trait data.

kimberlyh66 commented 5 years ago

Is this the spreadsheet (https://docs.google.com/spreadsheets/d/1nDVti2uj2cWboAmsqzQGyXidZFnqi5jPmBw23nGKH9E/edit#gid=1676929050) with new method names and variables? If @dlebauer approves, I can add to BETY.

kimberlyh66 commented 5 years ago

@max-zilla I can also help with uploading the trait data if you would like.

dlebauer commented 5 years ago

@Huynh, Kimberly My-Linh - (kimberlyh)mailto:kimberlyh@email.arizona.edu if you can update the method names and descriptions then Max can upload the trait data.


From: Kimberly Huynh notifications@github.com Sent: Monday, June 10, 2019 1:26:00 PM To: terraref/computing-pipeline Cc: LeBauer, David Shaner - (dlebauer); Mention Subject: Re: [terraref/computing-pipeline] Uploading BETY trait CSVs from Google Drive (#582)

@max-zillahttps://github.com/max-zilla I can also help with uploading the trait data if you would like.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/terraref/computing-pipeline/issues/582?email_source=notifications&email_token=AADRPZ33BB7E3GOS4VTDYN3PZ22FRA5CNFSM4HRFNEA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXLED5Q#issuecomment-500580854, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AADRPZ5F7CR3PDHMPCLZEQLPZ22FRANCNFSM4HRFNEAQ.

kimberlyh66 commented 5 years ago

All new methods and variables have been added to BETY.

@ZongyangLi You mentioned in the spreadsheet that the method 3D scanner to leaf length and width should be the same method that I used to upload Zeyu's data. If this is the case, you should use the name Scanner 3d ply data to leaf length and width.

ZongyangLi commented 5 years ago

@kimberlyh66 The method in the spreadsheet 3D scanner to leaf length and width is actually made for Zeyu, if you have already uploaded his data, then you could skip it.

max-zilla commented 5 years ago

Downloaded the rewritten version of all files, but I'm still getting an error on the citation:

/Users/mburnette/Downloads/BETYdbUploadsV2/s4_98th_height_rewrite/2017-05-06_98th_quantile.csv,{:
lookup_errors=>[
"No citation could be found matching {\"author\"=>\"ZongyangLi\", \"year\"=>\"2018\", \"title\"=>\"Maricopa Field Station Data and Metadata\"}", 
"No citation could be found matching {\"author\"=>\"ZongyangLi\", \"year\"=>\"2018\", \"title\"=>\"Maricopa Field Station Data and Metadata\"}", 
"No citation could be found matching {\"author\"=>\"ZongyangLi\", \"year\"=>\"2018\", \"title\"=>\"Maricopa Field Station Data and Metadata\"}",
...

I think the other fields are the same besides the 2018, I think this is another entry we need to add to BETY first?

ZongyangLi commented 5 years ago

@max-zilla I guess here year should be 2016.

Could you change it from 2018 to 2016 and try again?

If it works I can update all the csv files.

max-zilla commented 5 years ago

@ZongyangLi changing it to 2016 results in Success!

ZongyangLi commented 5 years ago

@max-zilla Should be all right this time, please find all collections here:

https://drive.google.com/open?id=1fDGakYulkLjLSAG0e_H-MEmjT69Bg2zF

max-zilla commented 5 years ago

Uploading these now, will close this once finished.

max-zilla commented 5 years ago

@ZongyangLi the LeafAngle and 98th height CSVs uploaded successfully, but the panicle CSVs encountered error:

No method could be found matching {"name"=>"3D scanner to panicle count faster_rcnn + roughness threshold + convex hull"}
max-zilla commented 5 years ago

@kimberlyh66 @ZongyangLi can we update this method in BETY so i can upload panicle data and close this? thanks!

ZongyangLi commented 5 years ago

@kimberlyh66

I think there was a type error in the spreadsheet previously, there was a missing letter 'h' in the word 'roughness threshold', could you change it to the right one? Thanks.

kimberlyh66 commented 5 years ago

@ZongyangLi @max-zilla the method has been updated to be 3D scanner to panicle count faster_rcnn + roughness threshold + convex hull

max-zilla commented 5 years ago

@kimberlyh66 thanks much! this is now uploaded & complete.