scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.4k stars 303 forks source link

Fixing skipped tests #413

Open ipeirotis opened 2 years ago

ipeirotis commented 2 years ago

Describe the bug We have a few issues with the tests, causing the builds to fail.

To Reproduce Taking a look at the build logs, we see the following errors:

Ubuntu

======================================================================
FAIL: test_download_mandates_csv (test_module.TestScholarly)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/scholarly/scholarly/test_module.py", line [66](https://github.com/scholarly-python-package/scholarly/runs/6168033939?check_suite_focus=true#step:7:66)8, in test_download_mandates_csv
    self.assertEqual(policy[agency_index], agency_policy[agency])
AssertionError: '82%' != ''
- 82%
+ 

======================================================================
FAIL: test_related_articles_from_author (test_module.TestScholarly)
Test that we obtain related articles to an article from an author
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/scholarly/scholarly/test_module.py", line 578, in test_related_articles_from_author
    self.assertEqual(pub[key], same_article[key])
AssertionError: [71](https://github.com/scholarly-python-package/scholarly/runs/6168033939?check_suite_focus=true#step:7:71)856 != 718[82](https://github.com/scholarly-python-package/scholarly/runs/6168033939?check_suite_focus=true#step:7:82)

----------------------------------------------------------------------

Mac

FAIL: test_download_mandates_csv (test_module.TestScholarly)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/work/scholarly/scholarly/test_module.py", line [66](https://github.com/scholarly-python-package/scholarly/runs/6168034015?check_suite_focus=true#step:7:66)8, in test_download_mandates_csv
    self.assertEqual(policy[agency_index], agency_policy[agency])
AssertionError: '82%' != ''
- 82%
+

======================================================================
FAIL: test_related_articles_from_author (test_module.TestScholarly)
Test that we obtain related articles to an article from an author
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/work/scholarly/scholarly/test_module.py", line 5[78](https://github.com/scholarly-python-package/scholarly/runs/6168034015?check_suite_focus=true#step:7:78), in test_related_articles_from_author
    self.assertEqual(pub[key], same_article[key])
AssertionError: 71856 != 718[82](https://github.com/scholarly-python-package/scholarly/runs/6168034015?check_suite_focus=true#step:7:82)

======================================================================
FAIL: test_related_articles_from_publication (test_module.TestScholarly)
Test that we obtain related articles to an article from a search
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/work/scholarly/scholarly/test_module.py", line 605, in test_related_articles_from_publication
    self.assertEqual(related_article['bib']['title'], 'Large Magellanic Cloud Cepheid standards provide '
AssertionError: 'Planck 2015 results-xiii. cosmological parameters' != 'Large Magellanic Cloud Cepheid standards [110 chars]ΛCDM'
- Planck 2015 results-xiii. cosmological parameters
+ Large Magellanic Cloud Cepheid standards provide a 1% foundation for the determination of the Hubble constant and stronger evidence for physics beyond ΛCDM

Windows

======================================================================
ERROR: test_download_mandates_csv (test_module.TestScholarly)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\scholarly\scholarly\test_module.py", line [66](https://github.com/scholarly-python-package/scholarly/runs/6168034099?check_suite_focus=true#step:7:66)7, in test_download_mandates_csv
    agency_index = funder.index(agency)
ValueError: 'US National Science Foundation' is not in list

======================================================================
ERROR: test_search_author_single_author (test_module.TestScholarly)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\scholarly\scholarly\test_module.py", line 305, in test_search_author_single_author
    scholarly.pprint(author)
  File "D:\a\scholarly\scholarly\scholarly\_scholarly.py", line 437, in pprint
    print(pprint.pformat(to_print))
  File "c:\hostedtoolcache\windows\python\3.8.10\x64\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' in position 6[68](https://github.com/scholarly-python-package/scholarly/runs/6168034099?check_suite_focus=true#step:7:68)1: character maps to <undefined>

======================================================================
FAIL: test_related_articles_from_author (test_module.TestScholarly)
Test that we obtain related articles to an article from an author
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\scholarly\scholarly\test_module.py", line 578, in test_related_articles_from_author
    self.assertEqual(pub[key], same_article[key])
AssertionError: [71](https://github.com/scholarly-python-package/scholarly/runs/6168034099?check_suite_focus=true#step:7:71)856 != 718[82](https://github.com/scholarly-python-package/scholarly/runs/6168034099?check_suite_focus=true#step:7:82)

----------------------------------------------------------------------

It is a bit confusing what causes the issues, especially the issues associated with the mandates calls

arunkannawadi commented 2 years ago

I've seen these and was never able to reproduce this locally. These seem specific to Github Actions which are making it extremely difficult to find and fix.

arunkannawadi commented 2 years ago

424 fixes all of actual bugs that are reported here. Two unsolved issues are:

  1. On Windows runners, writing and reading files appears problematic which causes an issue with download_mandates_csv. Resolving it by skipping the test on Windows only.

  2. The paper title mismatches is likely due to old caches on the machines. There's no way for us to force cache clearing, and Github automatically clears them after 7 days of not being used. https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy. I suggest we reduce the frequency of cron jobs to be run once a week to not be impacted by this caching.

arunkannawadi commented 2 years ago

Repurposing this issue to fix the tests we currently skip so we don't have to skip them.

arunkannawadi commented 2 years ago

The tests that passed during the merge are failing again. I propose we use CircleCI in addition to Github Actions for now to see if we can reproduce the failures and potentially debug them, which is not possible in Github Actions. https://circleci.com/docs2/2.0/ssh-access-jobs

ipeirotis commented 2 years ago

ScraperAPI increased the price to $50 per month and I canceled my subscription. Perhaps this is the reason for the failures.

arunkannawadi commented 2 years ago

No, I'm pretty sure that's not the reason. The tests are failing because the fetched values are not what they are expected to be. This is restricted to GHAs alone since I can't reproduce this locally. But good to know that the tests are otherwise robust with freeproxies

ipeirotis commented 2 years ago

I think the ScraperAPI key still works, but only for a limited number of queries.

Btw, perhaps the issue appears due to proxies caching the requests for longer than they should?

arunkannawadi commented 2 years ago

ScraperAPI increased the price to $50 per month and I canceled my subscription.

To make matters worse, ScraperAPI is charging 25 credits per request for pages in Google domain. https://www.scraperapi.com/documentation/#curl-CreditsAndRequests

This means we can only make 40 requests for Google Scholar per month with the free plan.