Closed lizzieinvancouver closed 4 years ago
@dbuona @FaithJones Can you both think about how best to fix this one? Given the issue it might be a case where fixing it in the original file is allowed (maybe). @DarwinSodhi Since this was your paper, can you work on extracting the critical data in Figure 1 (bottom panel, we need the DATES ... which I think are each represented twice so I would take the average)?
see also #355
@lizzieinvancouver Where should I put the extracted date data?
@DarwinSodhi Thanks! I would put it in update2019 and maybe give it a name like ospree_2019update_DSSaddmaly18 Also tagging @dbuona @FaithJones here as I am sure they need this.
@lizzieinvancouver Do you want me to attach the field sample dates to the original data I scarped? This will probably take some more time because there aren't 15 data points for each species in figure 2 but there are 15 field sampling dates. This would require me to re-scrape all the data again and observe which specific field dates are missing as I would need to compare it to my now scarped sampling dates?
@DarwinSodhi I am not sure, @dbuona and @FaithJones are supposed to head up how best to correct the data so I leave it to them.
@lizzieinvancouver Okay last question which may actually null and void this whole date scraping business is that in figure 1 (attached here) it is not clear what the final date on the X-Axis should be. How should I proceed? To me the last date could be April 31st 2014 but this could be a very dangerous assumption to make
@DarwinSodhi You should use ImageJ to extract the dates! You probably need to calculate DOY for each of the tick dates, then extract DOY, then @dbuona and @FaithJones (and me if it helps) can discuss how round to the correct date. But -- assuming the papers gives no dates (is that true?) -- we should not be guessing or assuming, just extracting the dates as we extract all other data.
@lizzieinvancouver Yes! I followed the instructions from the Entering new Data 2019 and Beyond wiki. The issue becomes that through the entire paper they tell you the first date they did field sampling which was "December 13th 2013" but not the last date (this is important because you need this information in order to calibrate the figure or in this case set the scale for the x-axis in order to calculate the the DOY). The methods in the paper only state "Starting on 13 December, every 7-10 days, 2 seedlings from each provenance and species were transferred from the sand bed to climate chambers (Fig. 1)". Without either the last field sampling date or information about where the x-axis date ends for figure 1, it is not possible to get the field sample dates for all the data.
@DarwinSodhi You should be able to calibrate the distance between the tick marks to make it work. I will leave it to you and @dbuona and @FaithJones to work out together as I think this is a tractable issue.
@lizzieinvancouver @FaithJones @dbuona I added the field sampling date for malyshev18. You can find it under ospree/data/update2019/ It is named ospree_2019update_DSSaddmaly18 let me know if there is anything else I can do.
@DarwinSodhi thanks for adding field sampling data. I have taken another look at the manuscript, and I think we also need a bit more data scraping, specifically chilling hours in Figure 1. As @lizzieinvancouver mentioned above, you should be ok scraping the data. Just calibrate from the 13th of December to the 1st of April, then the program will extrapolate from there. Once we have the days since 1st of December data for each point and the chilling data, we should be able to combine with data from Figure 2 top panels.
@FaithJones are you referring to the chilling temperature sum data in figure 1 correct?
@DarwinSodhi yep, thats right
@FaithJones Done!
@DarwinSodhi @FaithJones Hi Yall, the data that was originally scraped was from figure 2. I don't think figure 1 gives us enough information about the species or experimental treatments which is ultimately what we want.
I think figure 2 is the information we need. the top panel has "days since Dec 1" on the x axis. and days to bb at 20 degrees on the y axis, with 2 different photoperiod. I think this x axis should be the field sampling date for each species. My sense is the lower planel (GG~chill units) is the same data as the first panel but using temperature units. @lizzieinvancouver, correct me if I am wrong, but we should probably scrape both and the GDDs will get dropped in the clean process.
Let me know if this makes sense, or if you think I am reading the figures wrong
@FaithJones @dbuona I have scraped the data but it is still in days since December 10th, do either of you know a simpler way of converting the data to actual dates? I can also do it by hand but thought I would ask first!
Update: I figured it out and put the field sample dates in ospree/data/update2019/ospree_2019update_DSSaddmaly18.xlsx
@DarwinSodhi Darwin, did you scrape a response variable for malyshev18? I think, if I am not mistake, the newly scraped data needs to have the same format as the data_detailed tabs in order for it to integrate nicely with the rest of the database (see the prevey datasheets addinpreveydata.R for an example of what the final product should look like). @FaithJones does this seem right?
@dbuona The response variable was already scraped in the original file (ospree_2019update_DSS). I can re-add the y-variable to my current update as long as we don't end up with duplicates! Just let me know!
@DarwinSodhi I think the end goal of the task is to get the data in a format so that they can be added to the main database, which sounds like it will involve some sort of merger between what you originally scraped and the new field sample dates. Probably best to do in R so we can trace it. Do you want to give it a try and I can check any code you generate?
@DarwinSodhi @dbuona , I taken a look, and am now a little confused. I thought we needed the chilling data, but I don't see it in the scraped data. Dont we need the data from the forcing units/chilling units panels too?
@FaithJones The chilling will bet calculated in the ospree code. We needed accurate sample dates to be able to do that. Figure 1 does report chilling units at the field site up to each sample day, but the data we actually need for the analysis come from figure figure 2 which has the budburst data.
We could "combine" the two figure and report the chill units they report for each field sample day but to me it seems more transparent and consistent with the methodology of ospree to calculate it ourselves. @lizzieinvancouver Do you have any guidance on this?
@dbuona Yeah I can merge the two datasets in R and let you know when the code is done! I will wait to see what Lizzie says just in case I need to scrape more data!
@dbuona @FaithJones This is hard because they seem to have given us chill and field sample data (which is not very common), and -- to be standard across studies -- we used gridded climate data to convert field sample date to field chill units. I think we just need the data in Figure 2 with field sample date calculated from the x axis and response.time is y axis. @DarwinSodhi it looks like you did not scrape the x axis originally fro Figure 2? Is that correct or is it somewhere else?
Note that changing it to a calendar date should be done in R. I think merging it should be done by @FaithJones or @dbuona with the other one checking it.
We don't need forcing units scraped if we know the forcing treatment information.
To clarify, we can skip scraping the chill units data then. But we should note that it is exists and in which CU model in the notes column.
@lizzieinvancouver I scraped the x-axis on Figure 2 6 days ago, but I converted the data to calendar dates in excel, should I revert the data back to days since December 1st and re-add the data to ospree_2019update_DSS?
@DarwinSodhi Did you post the data unconverted in any way in Excel previously? (That's what we want ... converting back and forth in Excel is very bad as it do funny things due to leap year usually.) I can revert back to that if you did post it at some point.
@lizzieinvancouver no but I have the data saved in a file on my computer!
@lizzieinvancouver no but I have the data saved in a file on my computer!
@DarwinSodhi That's good! Please post that and I will take a look.
@lizzieinvancouver Done the data should be in ospree/data/update2019/ospree_2019update_DSSaddmaly18.xlsx
@DarwinSodhi Thanks, I will see where I can get then pass it off to @dbuona and @FaithJones to check.
@FaithJones @dbuona I worked on some code in analyses/cleaning/merge_update2019.R but it needs some extra checking (based on one plot I am concerned it is not correct). Whoever has time first tomorrow -- @FaithJones @dbuona -- please work on it and report back!
@cchambe12 If you don't hear any updates on this by 3pm EDT, please just start re-running the cleaning code so we can stay on track.
Thank you all!
@dbuona @FaithJones One more cleaning issue for this paper -- sample size should be checked and corrected if needed!
@dbuona @lizzieinvancouver @DarwinSodhi I am taking a look now, will try and provide an update soon.
@FaithJones Great! I'm going to do common garden observations so I won't rerun the climate data until later anyways.
@dbuona @lizzieinvancouver @DarwinSodhi I checked the merging code, and to mee it all looks good. @lizzieinvancouver I don't understand where you see a problem with Figure 2. When I looked the figure from the script matched the figure from the article closely. The species were in different orders, and the bottom panel with the chilling hours was missing, but I think it is supposed to be missing? The Figure in the paper is quite confusing in that it should be a 2 by n number of species plot but it wraps around once so it fits on there page. If this is not the problem, then I might need a nudge in the right direction.
@dbuona @lizzieinvancouver @DarwinSodhi Re sample size, I checked how many data point I could see, and I saw 181. This is the same number of data points Darwin scraped. I couldn't find a clear answer from the paper as to how many data points there should be.
@FaithJones Thanks for working on this! Quick reply ... First, 181 is not the sample size we want, we want the actual sample size that led to each data point; it's usually not on the graph, but may be mentioned in the methods. Second, I have put in two plots -- the first one uses the updated data and look like this:
This does not look like the paper figure to me, so next I plotted the data versus the dayssince:
And I agree this looks good, but it does not use the data just created ....
PS: Please add your working directory as an if else, otherwise it works only for you, and no one else.
@dbuona @lizzieinvancouver @DarwinSodhi OK, i dived back in, and this is my current take on the situation. I still think the data generally looks accurate. @lizzieinvancouver, when I converted the dates into Dates rather than characters, it looked the same as the dayssince plot. So this might be the cause of the problem? I updated the cleaning scrips (including the setwd, sorry).
As for numbers of data points, I actually think we are too low but it is difficult to say for sure. The manuscript is vague. They said that "starting on 13 December, every 7-10 days, 2 seedlings from each provenance and species were transferred from the sand bed to climate chambers (Fig. 1)". Figure 1 suggests 15 different dates that seedlings were brought in. But I think some species burst bud earlier, so they have fewer sampling points. I think we are missing a few points where the 16 hour photoperiod ata point is hiding behind the 8 hour photoperiod, but I am not sure.
There is also one data point for Acer where the response time is far too high (>60). I think this is a typo?
@FaithJones Thanks! Yes, I thought it was likely it was just a plotting issue, but I did not have time to check and fix it. I will check on the sample size and >60 value and get back to you when I am next online. @cchambe12 please go ahead on cleaning/merging the full data and thank you!
@FaithJones I think you are right! They have just n=1 for each data point ("We therefore used one tree seedling per sampling date per photoperiod treatment (i.e. we allocated the 15 tree seedlings per species across the maximum number of sampling dates)"). A bit weird, but I think correct given that they go on to defend this at length in the paper. I also agree that Acer value >60 looks very wrong, @DarwinSodhi, can you check what is going on?
I forgot to push the malyshev18 csv and just did, I will update the merge_update2019.R now.
@lizzieinvancouver I incorrectly added that value, it should read 5.574 instead of 68.837. How should I change the value?
Thanks! @DarwinSodhi I think you should clean it in analyses/cleaning/clean_misc.R something like:
# Fix malyshev18 incorrect datapoint by Darwin in April 2020
d$response.time[which(d$datasetID=="malyshev18") & d$genus=="Acer' & d$species=="psuedowhatever" & field.sample.date=="which one" & photoperiod==""] <- 5.574 # incorrect original entry of data (was 68.837, way too high).
It should go towards the end, after Nacho's fix on anzanello16. @FaithJones Can you check it once done? And make a note in merge_update2019.R that this is fixed in clean_misc.R?
Thank you both!
@lizzieinvancouver @FaithJones Done! the code after #fixmalyshev in analyses/cleaning/clean_misc.R
@lizzieinvancouver https://github.com/lizzieinvancouver @DarwinSodhi https://github.com/DarwinSodhi I took a look, but I am not exactly sure how it should work. There is indeed clean and annotated code in clean_misc.R that fixes the problem. It did not fix the problem in the file where we were checking the data, but I assume that is because it will run on the full dataset once it is added to the spreadsheet? Anyway, I added the merge_update2019.r code, and the data looks good after that. I added a note in this code explaining that the problem was fixed in analyses/cleaning/clean_misc.R.
On Mon, 27 Apr 2020 at 11:42, Darwin Sodhi notifications@github.com wrote:
@lizzieinvancouver https://github.com/lizzieinvancouver @FaithJones https://github.com/FaithJones Done! the code after #fixmalyshev in analyses/cleaning/clean_misc.R
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lizzieinvancouver/ospree/issues/354#issuecomment-620162819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5QS2TMKLCR4C7BEUUMPI3ROXG2ZANCNFSM4LT36JOQ .
@FaithJones Ah sorry to add to the confusion! When I was fixing man17 in clean_misc.R I noticed and aberrant parentheses in the malyshev code so I moved it and reran it. Glad it is working now!
@DarwinSodhi @dbuona @FaithJones There are about a dozen fieldsample.dates that are year-month-year. Do we know what these values should be?
Thank you @DeirdreLoughnan! I worry I may have added this, so I will check today and report back to everyone (@DarwinSodhi @dbuona @FaithJones)
@FaithJones I ran cleanmerge_all.R through this lined <- rbind(dorg, dup)
and then almost of all clean_misc.R until the new line, I checked how that was working (yes!). And then for good measure, I cleaned out R (restarted it) and then I ran cleanmerge_all.R through step 10 and checked for what this value is:
d$response.time[which(d$datasetID=="malyshev18" & d$genus=="Acer" & d$species=="pseudolatauns" & d$fieldsample.date=="2014-Mar-31" & d$photoperiod_day=="16")]
It's 5.574 so I think we're good here. Thanks @DarwinSodhi and @FaithJones !
Fig 1 suggest to me there is a chilling treatment ...