ropensci / webchem

Chemical Information from the Web
https://docs.ropensci.org/webchem
Other
163 stars 41 forks source link

Get_cid pulls many CIDs, some wrong #417

Open daithi45 opened 7 months ago

daithi45 commented 7 months ago

Hi all, I'm running a dataset of ~1000 CAS#s through webchem to pull CIDs. For about half of them, it pulls multiple CIDs.

get_cid("613-33-2", from="cas", match = "all")
# A tibble: 2 × 2
  query    cid   
  <chr>    <chr> 
1 613-33-2 11941 
2 613-33-2 170889

Most of the time, the first CID it pulls isn't the correct one and requires manual checking. Is there any way to improve my approach to reduce the manual element?

Aariq commented 7 months ago

At first, I thought "this is just the nature of CAS numbers" or "this is just how searching for CAS numbers on pubchem works", but in this example, if I search for 613-33-2 on pubchem, I only get one result. It might be worth it to double check how we are querying the pubchem API here, @stitam, and if there is an alternative way that only returns the best match according to pubchem ("best" is currently not an option for the match argument of get_cid())

daithi45 commented 7 months ago

Interestingly, if I take out the from="cas" element, I only get 1 CID back, will try this on my main dataset and see if it works!

> get_cid("613-33-2", match = "all")
# A tibble: 1 × 2
  query    cid  
  <chr>    <chr>
1 613-33-2 11941