lizzieinvancouver / egret

2 stars 0 forks source link

Removing species - crops, ornamental, monocots #5

Closed DeirdreLoughnan closed 2 years ago

DeirdreLoughnan commented 2 years ago

I have been working on removing species to get a more manageable list of papers. For now we are removing crops and ornamental species, and monocots (although, we might add them back in later).

Dan categorized species as crops and I have been using the Taxize R package to identify monocots. The R code can be found here.

This has not worked as well as I had hoped. Correct me if I am wrong, but the monocot vs dicot classification is a taxonomic division? This should be doable with the taxize package, but ncbi just produced a column of NA's and itis Tracheophyta.

@lizzieinvancouver @dbuona is division not correct? Do you know what the correct taxonomic designation would be? Or will we just have to remove families we know are monocots?

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Eudicots should be a group available -- is it? I could not tell from the code ... (I think Tracheophyta is just green plants ...). Jonathan said the GBIF part of taxize is the most user friendly, though I see eudicots and monocots are 'unranked' on GBIF's website... but I also found this code in the PDF for the taxize package:

get_uid(sci_com = "Echinacea", division_filter = "eudicots")

So I would try to get to there ....

DeirdreLoughnan commented 2 years ago

Thanks @lizzieinvancouver!

That is a good point, I was coming at it by classifying then subsetting, but that filter function would work just as well!

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver I played around with that code today, but was disappointed by the output, an example of which is here: Screen Shot 2022-07-30 at 5 59 39 PM I was hoping it would have a column of T/F for whether it was a monocot or dicot, but that is not what either multiple_matches or pattern_match represent. I will try to think of another way forward.

lizzieinvancouver commented 2 years ago

@dbuona Could you spare an hour to see where you get on this? If you can't figure it by then I can try.

dbuona commented 2 years ago

Yes. I’ll give it a shot in the first half of this week if that is okay for the general timeline

On Wed, Aug 3, 2022 at 5:13 PM lizzieinvancouver @.***> wrote:

@dbuona https://github.com/dbuona Could you spare an hour to see where you get on this? If you can't figure it by then I can try.

— Reply to this email directly, view it on GitHub https://github.com/lizzieinvancouver/oegres/issues/5#issuecomment-1204489810, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFCYQCG4JQ4KL3BH5AACTVXLOI7ANCNFSM55CD6DZQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Daniel Buonaiuto (he/him/his), PhD Department of Environmental Conservation University of Massachusetts, Amherst https://dbuonaiuto.wixsite.com/ecology

dbuona commented 2 years ago

I ran the code in subsettingMonocots.R, and I do not think I am following exactly what it is doing, but it seems like there is an output file of dicots and one of monocots? At a glance, they look correct to me, so I think I am unclear about the exact task at hand.

There are some monocot orders in the data frame eudicot, but it seems like those disappear when you get to the eudicotFiltered data object... Anyway if you can give me more specific instructions about where help is needed, I am happy to poke at it some more.

lizzieinvancouver commented 2 years ago

@dbuona Thanks! The goal is to end up with just the eudicots so we can match them to which studies to extract ... @DeirdreLoughnan seemed to think it wasn't working, but maybe you think it is?

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver @dbuona I think the issue is that while things are generally correct they are not completely correct.

The two datasets eudicot (line 121) and monocot (122) are made based on the list of monocots (line 119) that I know/could find with a quick google. So this is far from robust.

I then tried to use the "division_filter" in get_uid (lines 126-129) and this generates a list that I then try to extract and create a dataset I called "eudicotFiltered", but no genus are being removed and the columns only give you information on whether there is a match in the database and whether there were multiple matches.

So essentially, I don't think this function is at all useful. But that still leaves us without a reproducible way of sorting species.

@dbuona do you have a good tree of plant evolution that perhaps we could just extract the genus names on the eudicot branch from the monocot branch and then sort our datasets that way?

lizzieinvancouver commented 2 years ago

@dbuona Any thoughts on this?

dbuona commented 2 years ago

yes! Deidre and I have been chatting about this off git. I think we should just filter out monocots at the level of order---there are 11 monocot order which I think should be relatively easy to grab using taxostand. I have Wednesday morning of this week slotted to work on doing this and will report back then

lizzieinvancouver commented 2 years ago

@dbuona Sweet! That sounds great.

dbuona commented 2 years ago

Okay, I added a taxonomic filter for monocots at the order level line 93-102 in subsettingMonocts.R. It did produce a narrower list than previously so I think thats good. I am not sure I integrated this list properly---I don't understand wat lines 102+ in the code are really doing (the lapplys and do.calls etc), so maybe I will kick this back to @DeirdreLoughnan to put the finishing touches on the subset based on the "eudicot2"list I created in line 100? I tried to add that into the function in line 110 but all that appeared to do was break it. I am also happy to zoom if it is easiest to work through this together.

lizzieinvancouver commented 2 years ago

@dbuona Thanks! Fewer species sounds good to me. @DeirdreLoughnan Let us know how it goes. We should probably have a couple species we check are excluded and we check are included.

DeirdreLoughnan commented 2 years ago

Thanks @dbuona, I went through the subsetting code. But your code looks good to me and we no longer need the code below line 102.

Ultimately, this subsampling reduces the number of studies to 426. I have pushed a new oegres.xlsx file and will tell everyone to update their respective source tabs on Monday when we meet.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan 426? Whoo-hoo! This is great news.

We can probably close this too.