hfthair / emerald_crawler

10 stars 2 forks source link

no "jpage_entries" can be matched in the download_journal function #9

Open Ewanwong opened 1 year ago

Ewanwong commented 1 year ago

Hi, I tried to crawl the dataset with your code, but ended up with nothing downloaded. I inspect into the process and find no 'jpage_entries' are matched in line 55 of download.py. Besides, I have only 517 items in journal_url_list, is this supposed to be correct? Thanks

memray commented 1 year ago

Ah you may need to tweak the code a bit if emerald changes the webpage template (I don't see big changes by appearance). If you just need our data dump for research purposes, you can shoot us an email (memray0@gmail.com)

jpramos123 commented 1 year ago

Looking into this, looks like the regex is not capturing things correctly:

    jpage_entries = re.findall(
        r'<a href="/insight/publication.*class="intent_tocIssueLink".*</a>', jpage.text)

When trying the regex manually, looks like it works finds the pattern...

Trying to figure this out

jpramos123 commented 1 year ago

Changed to the code below seems to work

regex = r'<a href="/insight/publication.*class="intent_tocIssueLink".*</a>'
text = jpage.text.replace("\n", "\\n")
jpage_entries = re.findall(regex, text, re.MULTILINE)