Open DeirdreLoughnan opened 1 week ago
@kengi-neer posted the following in issue #14, and I have added some quick responses:
Most of the unusual values have been cleaned already. Awesome work, thank you!
Some questions/notes:
- Are the papers for bibby53, cousins10, and morozowska02 available? I haven't seen them in the drive and I didn't see them online through their titles.
Thanks for bringing this to my attention, I am not sure why some of these were scraped as @ngoj1 is correct that they are not in the source fine. I will look into it!
- Data scraping for al-absi10/Al-Absi10 might have been done twice separately. yang18 seems to include all three papers, though all have different species.
Yes, the issue with Al-absi10 has been corrected in the mergeData.R file. But I will have to look into the yang18 issue.
- Some points might be missing from pipinis12 and tylkowski91 since all others in the same table have been included.
If you think data was not entered, could you please add it, this might have been a simple mistake.
- Some response variable values are averaged over different treatments or interaction deemed not significant (chill temperature, duration, and even chemical) so they were not differentiated. I was thinking of just removing data for chill temperature and duration for those since they aren't helpful anyway. Examples in wytsalucy21 (chill temp - 4, 7, 10), Nin17 (chill dur - 15;30;60;90), okay11 (chill dur - 90, 120, 150), and tang21 (chill dur - 0, 15, 30, 60).
We don't want to remove any data at this stage of cleaning. It is ok to leave the values as they are for not, but please double check that there really is no way to differentiate the treatment levels in the figures. If we could fix this, then we should.
- Quite many values for chill temperature and duration seem to be for germination, and vice versa, possibly storage too (will check yang18). Some examples are the corrected unusual values in yang08 (30/20) or even the corrected normal-looking values in tylkowski91 (3).
- na11 has some contradicting descriptions of their treatments. Values stated in methods are different from those stated in results.
Interesting, let me think on these two.
- Response variable and values for Schutz02 table 3 and pritchard93 table 3 seem to be mismatched.
Is this something that can be fixed by going back to the pdf, or do you think it is a typo in the publication?
- Chill durations are rounded to the nearest integer.
The unit is in days, so this makes sense to me and seems fine.
- Treatments without chilling in papers with chilling treatments were given a chill duration of 0 and chill temperature of NA. Or should I still include a chill temperature for treatments specified with a 0-week duration for grouping purposes?
Are these to indicate control treatments? I think for a control this sounds reasonable.
- Data points with chill temperature without duration in jacquemart21 and Naseri18.
I assume you have double checked the pdf's, if so having the duration be NA makes sense to me.
I still have to check data with normal-looking chill temperature and duration values due to cases similar to 5 (yang08 and yang18 come to mind), but there are 2335, 875, and 1539 data points with temperature uncertainty, alternating temperature cycle, and daily light values. I put notes in the script, and separated and tagged changes by paper if needed.
Awesome thanks!
@lizzieinvancouver, I spoke with @kengi-neer today, he has found several datapoints in papers that were not entered (I have found a few too).
I know for OSPREE when we entered new raw data, we created new files and we tend not to edit the original excel files. Do you have a preference for what we do here? Should we just add new rows of data to the existing files (which were created by people no longer in the lab and might be simplest) or create a new file for "new" data entry?
@lizzieinvancouver, some additional details on what @DeirdreLoughnan and I discussed earlier on some issues.
Thanks for bringing this to my attention, I am not sure why some of these were scraped as @ngoj1 is correct that they are not in the source fine. I will look into it!
I found bibby53 listed in the source excel sheet from DM, which we think might have been supposed to be removed some time ago.
We don't want to remove any data at this stage of cleaning. It is ok to leave the values as they are for not, but please double check that there really is no way to differentiate the treatment levels in the figures. If we could fix this, then we should.
Will not change for now, but maybe consider removing details on treatments since they are averaged across treatments such as different chill temperature or durations.
Is this something that can be fixed by going back to the pdf, or do you think it is a typo in the publication?
I think they can be changed by going back to the pdf. For Schutz02, the response values are actually the number of days until more than 1% of the seed had germinated, which could be moved to germ duration with the response values as 1 instead as I seem to have noticed to have been done for another paper for 50% germination (not sure, will try to find while checking them again). For pritchard93, the response values are actually the mean length, while the per germ columns are beside it, at the left for embryos isolated from seed and at the right for seeds.
Are these to indicate control treatments? I think for a control this sounds reasonable.
Yes, I just have to think about controls for experiments wherein treatments are more of a list (e.g. soaking vs scarification vs warm strat vs cold strat) rather from a factorial arrangement (e.g. chemical x cold strat duration x photoperiod), which may be better off without a chill temperature at all.
I assume you have double checked the pdf's, if so having the duration be NA makes sense to me.
For Naseri18, some seeds were simply left for more than 16 weeks without any specific duration. For jacquemart21, they simply mentioned cold stratification but without mention of any specific treatments (they have a lot of combination of stratification and chemical scarification treatments).
@kengi-neer Thanks for the update! These seem like good decisions and it's great to have them clearly documented in the git issue here.
@lizzieinvancouver I would just like to follow up on how to add missing data points as @DeirdreLoughnan brought up last week.
Also, some notes/questions on cleaning:
I know for OSPREE when we entered new raw data, we created new files and we tend not to edit the original excel files. Do you have a preference for what we do here? Should we just add new rows of data to the existing files (which were created by people no longer in the lab and might be simplest) or create a new file for "new" data entry?
@DeirdreLoughnan @kengi-neer I think creating a single file for new data entry (that various people can add to) sounds best for this.
for warm stratification of 20°C (16h dark) and 25°C (8h light) for 1 month followed by cold stratification of 2-4°C (dark) for 1 month.
Just a few notes:
Thank you @DeirdreLoughnan for the access.
Originally posted by @kengi-neer in https://github.com/lizzieinvancouver/egret/issues/14#issuecomment-2167094617