lizzieinvancouver / egret

1 stars 0 forks source link

Cleaning chill temp and duration #23

Open DeirdreLoughnan opened 1 week ago

DeirdreLoughnan commented 1 week ago
          @lizzieinvancouver I've temporarily added three columns for uncertainty (degC), temperature cycles for alternating temperatures (h), and daily light (h per day), then I'll tally them after to check if they're worth including. A sample entry would be

Screenshot 2024-06-13 185946

for warm stratification of 20°C (16h dark) and 25°C (8h light) for 1 month followed by cold stratification of 2-4°C (dark) for 1 month.

Just a few notes:

Thank you @DeirdreLoughnan for the access.

Originally posted by @kengi-neer in https://github.com/lizzieinvancouver/egret/issues/14#issuecomment-2167094617

DeirdreLoughnan commented 1 week ago

@kengi-neer posted the following in issue #14, and I have added some quick responses:

Most of the unusual values have been cleaned already. Awesome work, thank you!

Some questions/notes:

  1. Are the papers for bibby53, cousins10, and morozowska02 available? I haven't seen them in the drive and I didn't see them online through their titles.

Thanks for bringing this to my attention, I am not sure why some of these were scraped as @ngoj1 is correct that they are not in the source fine. I will look into it!

  1. Data scraping for al-absi10/Al-Absi10 might have been done twice separately. yang18 seems to include all three papers, though all have different species.

Yes, the issue with Al-absi10 has been corrected in the mergeData.R file. But I will have to look into the yang18 issue.

  1. Some points might be missing from pipinis12 and tylkowski91 since all others in the same table have been included.

If you think data was not entered, could you please add it, this might have been a simple mistake.

  1. Some response variable values are averaged over different treatments or interaction deemed not significant (chill temperature, duration, and even chemical) so they were not differentiated. I was thinking of just removing data for chill temperature and duration for those since they aren't helpful anyway. Examples in wytsalucy21 (chill temp - 4, 7, 10), Nin17 (chill dur - 15;30;60;90), okay11 (chill dur - 90, 120, 150), and tang21 (chill dur - 0, 15, 30, 60).

We don't want to remove any data at this stage of cleaning. It is ok to leave the values as they are for not, but please double check that there really is no way to differentiate the treatment levels in the figures. If we could fix this, then we should.

  1. Quite many values for chill temperature and duration seem to be for germination, and vice versa, possibly storage too (will check yang18). Some examples are the corrected unusual values in yang08 (30/20) or even the corrected normal-looking values in tylkowski91 (3).
  2. na11 has some contradicting descriptions of their treatments. Values stated in methods are different from those stated in results.

Interesting, let me think on these two.

  1. Response variable and values for Schutz02 table 3 and pritchard93 table 3 seem to be mismatched.

Is this something that can be fixed by going back to the pdf, or do you think it is a typo in the publication?

  1. Chill durations are rounded to the nearest integer.

The unit is in days, so this makes sense to me and seems fine.

  1. Treatments without chilling in papers with chilling treatments were given a chill duration of 0 and chill temperature of NA. Or should I still include a chill temperature for treatments specified with a 0-week duration for grouping purposes?

Are these to indicate control treatments? I think for a control this sounds reasonable.

  1. Data points with chill temperature without duration in jacquemart21 and Naseri18.

I assume you have double checked the pdf's, if so having the duration be NA makes sense to me.

I still have to check data with normal-looking chill temperature and duration values due to cases similar to 5 (yang08 and yang18 come to mind), but there are 2335, 875, and 1539 data points with temperature uncertainty, alternating temperature cycle, and daily light values. I put notes in the script, and separated and tagged changes by paper if needed.

Awesome thanks!

DeirdreLoughnan commented 1 week ago

@lizzieinvancouver, I spoke with @kengi-neer today, he has found several datapoints in papers that were not entered (I have found a few too).

I know for OSPREE when we entered new raw data, we created new files and we tend not to edit the original excel files. Do you have a preference for what we do here? Should we just add new rows of data to the existing files (which were created by people no longer in the lab and might be simplest) or create a new file for "new" data entry?

kengi-neer commented 1 week ago

@lizzieinvancouver, some additional details on what @DeirdreLoughnan and I discussed earlier on some issues.

Thanks for bringing this to my attention, I am not sure why some of these were scraped as @ngoj1 is correct that they are not in the source fine. I will look into it!

I found bibby53 listed in the source excel sheet from DM, which we think might have been supposed to be removed some time ago.

We don't want to remove any data at this stage of cleaning. It is ok to leave the values as they are for not, but please double check that there really is no way to differentiate the treatment levels in the figures. If we could fix this, then we should.

Will not change for now, but maybe consider removing details on treatments since they are averaged across treatments such as different chill temperature or durations.

Is this something that can be fixed by going back to the pdf, or do you think it is a typo in the publication?

I think they can be changed by going back to the pdf. For Schutz02, the response values are actually the number of days until more than 1% of the seed had germinated, which could be moved to germ duration with the response values as 1 instead as I seem to have noticed to have been done for another paper for 50% germination (not sure, will try to find while checking them again). For pritchard93, the response values are actually the mean length, while the per germ columns are beside it, at the left for embryos isolated from seed and at the right for seeds.

Are these to indicate control treatments? I think for a control this sounds reasonable.

Yes, I just have to think about controls for experiments wherein treatments are more of a list (e.g. soaking vs scarification vs warm strat vs cold strat) rather from a factorial arrangement (e.g. chemical x cold strat duration x photoperiod), which may be better off without a chill temperature at all.

I assume you have double checked the pdf's, if so having the duration be NA makes sense to me.

For Naseri18, some seeds were simply left for more than 16 weeks without any specific duration. For jacquemart21, they simply mentioned cold stratification but without mention of any specific treatments (they have a lot of combination of stratification and chemical scarification treatments).

lizzieinvancouver commented 1 week ago

@kengi-neer Thanks for the update! These seem like good decisions and it's great to have them clearly documented in the git issue here.

kengi-neer commented 4 days ago

@lizzieinvancouver I would just like to follow up on how to add missing data points as @DeirdreLoughnan brought up last week.

Also, some notes/questions on cleaning:

  1. For data averaged across treatments, usually due to different parameters not having significant interactions, I temporarily changed their values to "ave", please let me know if this is better than just putting it as NA.
  2. Treatment for yang08 Table 3 were stated in different forms including "moist storage", "wet storage", and "stratification", with storage periods reaching up to 24 months but stated to have a similar procedure used for stratification. Should this be considered storage data or chill data?
  3. yang16 and yang20 have two papers each as indicated in the data source file, but there are only data points for one paper in the data after merging (which is in the EGRET drive as yang16_2 and yang20). There could be missing data from yang16_1 Table 5 and yang20 Table 1, though yang20_1 is in another language except for the abstract.
  4. The three yang18 papers all have storage data in chill columns.
  5. yang18_2 seem to have mismatched figure/table sources; I'll clean it later after it has been fixed since it's easier to index by figure/table source.
lizzieinvancouver commented 3 days ago

I know for OSPREE when we entered new raw data, we created new files and we tend not to edit the original excel files. Do you have a preference for what we do here? Should we just add new rows of data to the existing files (which were created by people no longer in the lab and might be simplest) or create a new file for "new" data entry?

@DeirdreLoughnan @kengi-neer I think creating a single file for new data entry (that various people can add to) sounds best for this.