Open Ewanwong opened 1 year ago
Ah you may need to tweak the code a bit if emerald changes the webpage template (I don't see big changes by appearance). If you just need our data dump for research purposes, you can shoot us an email (memray0@gmail.com)
Looking into this, looks like the regex is not capturing things correctly:
jpage_entries = re.findall(
r'<a href="/insight/publication.*class="intent_tocIssueLink".*</a>', jpage.text)
When trying the regex manually, looks like it works finds the pattern...
Trying to figure this out
Changed to the code below seems to work
regex = r'<a href="/insight/publication.*class="intent_tocIssueLink".*</a>'
text = jpage.text.replace("\n", "\\n")
jpage_entries = re.findall(regex, text, re.MULTILINE)
Hi, I tried to crawl the dataset with your code, but ended up with nothing downloaded. I inspect into the process and find no 'jpage_entries' are matched in line 55 of download.py. Besides, I have only 517 items in journal_url_list, is this supposed to be correct? Thanks