Open tarinidash opened 3 years ago
Unfortunately it looks like the RSC "articlelanding" page is assembled dynamically using JavaScript, as you scroll down the page. So the HTML that you save may not include the full article, even though it appears to in the browser. It might work better if you click the "Article HTML" button on the right and save that page instead: https://pubs.rsc.org/en/content/articlehtml/2015/tc/c5tc02626a
The demo results on the web site were run quite a few years ago now, so unfortunately the article HTML may also have changed since. The full web site code is available at: https://github.com/mcs07/cdeweb
It only does a couple of extra things to extend the output - all in the get_biblio
and add_structures
functions.
Hey Matt, Thank you for your response. I did click on "Article HTML" which took me to "https://pubs.rsc.org/en/content/articlehtml/2015/tc/c5tc02626a" . Then I saved this HTML file locally using File- Saved As from chrome browser.
I will go through the rest of your code and recommendation.
Thanks Tarini Dash
On Wed, Sep 1, 2021 at 11:30 AM Matt Swain @.***> wrote:
Unfortunately it looks like the RSC "articlelanding" page is assembled dynamically using JavaScript, as you scroll down the page. So the HTML that you save may not include the full article, even though it appears to in the browser. It might work better if you click the "Article HTML" button on the right and save that page instead: https://pubs.rsc.org/en/content/articlehtml/2015/tc/c5tc02626a
The demo results on the web site were run quite a few years ago now, so unfortunately the article HTML may also have changed since. The full web site code is available at: https://github.com/mcs07/cdeweb
It only does a couple of extra things to extend the output - all in the get_biblio https://github.com/mcs07/cdeweb/blob/master/cdeweb/tasks.py#L50 and add_structures https://github.com/mcs07/cdeweb/blob/master/cdeweb/tasks.py#L62 functions.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mcs07/ChemDataExtractor/issues/38#issuecomment-910398372, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHTUUW4WL6PIUPWAYKCYMTT7ZBKDANCNFSM5DGQNZPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I have been working with this library to extract chem information from HTML pages. I followed http://chemdataextractor.org/demo and saved https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A as an html(input3.html) file.
Below is my code.
with open('input/input3.html', 'rb') as f: doc = Document.from_file(f)
records = doc.records.serialize()
This does not matches with the records in the json output published at https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A . A lot of information is missing including smiles, fluorescence_lifetimes etc.
@mcs07 was wondering if you could publish the code that was used for the demo.
Ps : Is there a method which creates the entire json which includes abbreviation + biblio + record or they are extracted separately and stitched together to create the final json output.