spinlud / py-linkedin-jobs-scraper

MIT License
337 stars 96 forks source link

Scraper is not able to retrieve more than 60 results per query #4

Closed NLCas8 closed 3 years ago

NLCas8 commented 3 years ago

Hi,

First of all, thank you for this great tool @spinlud!

When I run a query the scraper never gets past page 9 (about 50-60 individual results) somehow, see below:

INFO:li:scraper:('[Data analist][Nederland][56]', 'Pagination requested (9)') INFO:li:scraper:('[Data analist][Nederland][56]', "Couldn't find more jobs for the running query")

If I were to run the exact query using LinkedIn itself it usually finds more than 5000 results. I have tried changing the parameters, like setting TIME=ANY, LIMIT=10000, and different combinations, but with no luck yet.

Is this a bug, a limitation of the LinkedIn API, or perhaps am I doing something wrong?

Thank you for your help!

spinlud commented 3 years ago

Hi there! Can you share the code of the query?

NLCas8 commented 3 years ago

This is the one of them I was trying:

query = Query(
    query='Data analist',
    options=QueryOptions(
        locations=['Nederland'],
        optimize=True,
        limit=10000,
        filters=QueryFilters(
            company_jobs_url=None,
            relevance=RelevanceFilters.RECENT,
            time=TimeFilters.ANY,
            type=TypeFilters.FULL_TIME,
            experience=ExperienceLevelFilters.MID_SENIOR,
        )
    )
    )

It always seems to stop after 56 results. If I run the example query, interestingly it does retrieve more results than 56.

spinlud commented 3 years ago

With that settings I got 186 jobs:

Screenshot 2021-01-31 at 16 51 30

Can you trying again removing all filters?

NLCas8 commented 3 years ago

I retried with the following query:

query = Query(
    query='Data analist',
    options=QueryOptions(
        locations=['Nederland'],
        optimize=True,
        limit=10000,
    )
    )

image

Final output:

INFO:li:scraper:('[Data analist][Nederland][55]', 'Processed')
INFO:li:scraper:('[Data analist][Nederland][56]', 'Processed')
INFO:li:scraper:('[Data analist][Nederland][56]', 'Pagination requested (9)')
INFO:li:scraper:('[Data analist][Nederland][56]', "Couldn't find more jobs for the running query")

I am not sure what is causing this.

spinlud commented 3 years ago

It seems to me there are problems when scrolling/loading more jobs on Linkedin website using an anonymous session (logged out). I mean, I've tried to open this url on Chrome in incognito and normal mode, both on Mac and Windows and got the same result: at some point, while scrolling jobs, pagination stops working.

image

This is normal browser navigation, it doesn't have anything to do with this library. To double check it is not an ip-related issue I have also tried to connect to my smartphone hotspot (to get a different ip), but faced the same problem. Honestly I don't think there is nothing I could do here, it seems a Linkedin issue to me. What you can try is to use an authenticated session, as described here and see if it helps!

NLCas8 commented 3 years ago

Opening the url in incognito mode I indeed see the same happening, where the pagination stops working suddenly. If I open the url while logged in I can scroll down and am able to click on one of the next pages.

Actually, I was already trying with an authenticated session, so that did not make a difference unfortunately. It seems as if it behaves like it was an anonymous session somehow, even though it does say it is using the AuthenticatedStrategy:

INFO:li:scraper:('Using strategy AuthenticatedStrategy',) INFO:li:scraper:('[Data scientist][Netherlands]', 'Setting authentication cookie') INFO:li:scraper:('[Data scientist][Netherlands]', 'Session is valid')

spinlud commented 3 years ago

I found a possible bug with pagination when using authenticated session. Could you retry with the latest version and see if it helps?

NLCas8 commented 3 years ago

It's fixed now, you're an absolute legend! :D

spinlud commented 3 years ago

FYI @NLCas8
It seems that pagination issues in anonymous mode are caused by Linkedin not being compliant to Chrome CSP (Cross Security Policy). Honestly I don't know when (and if) they are going to fix this, but I've found a workaround to force CSP bypass using Chrome Developer Tools protocol. Long story short, pagination should work properly again even in anonymous mode 🎵

PS: you can try it yourself enabling this chrome extension on Linkedin page

NLCas8 commented 3 years ago

I just gave it a shot in anonymous mode and it's working indeed, it's retrieving over 50 results now. Pretty neat!

Also, if other people run into this thread, I found that setting slow_mo to 5 would give me the best experience. Any faster and it will give you the 'Too many requests' error at some point. 5 seems to be perfect, where you can keep it running as long as you wish without errors :)