tdurieux / leboncoin-api

DEPRECATED
https://www.npmjs.com/package/leboncoin-api
169 stars 54 forks source link

API not returning results #53

Closed jspingau closed 4 years ago

jspingau commented 5 years ago

Hi,

Since yesterday the API does not return any results on previously working queries. When debugging from lib/search.js I see the content returned by the search API only contains content like:

{"url":"https://c.datado.me/captcha/?initialCid=AHrlqAAAAAMAfhnhHfj6eVIAMw89Zw==&hash=05B30BD9055986BD2EE8F5A199D973&t=fe"}

so it looks like some captcha validation is expected ...from an API call, weird.

tdurieux commented 5 years ago

Apparently, they put new security in place. It does not sound good for this library :/

alexismoreau commented 5 years ago

I use this library for the category 'ventes_immobilieres' and everything seems to be working fine

tdurieux commented 5 years ago

@jspingau are you using a server, if yes from which provider? @alexismoreau same question

Thank you

jspingau commented 5 years ago

@tdurieux: I am, Scaleway.

alexismoreau commented 5 years ago

I use it both on my OVH VPS and from my local computer to check just now. No problem at all.

alexismoreau commented 5 years ago

Maybe you are blacklisted, how many request do you make ? Did you try on a local computer ?

jspingau commented 5 years ago

yes, might be what happened. I'll check running on a local computer later today and keep everyone updated.

jspingau commented 5 years ago

ok, I could finally test sooner than expected.

I confirm it works locally so I was blacklisted.

This is weird because usage should not be that high: this is a script that creates rss feeds out of searches on lbc. Granted, the rss reader does fetch the subscriptions on a regular basis but I'm the only one using this, so volume must be a few hundred hits per day...

Maybe using a different API key will help. I cannot find the documentation of api.leboncoin.fr, anyone knows how to get an API key?

Otherwise, I guess I'll just setup a vanilla proxy on a separate VM and change it as I'm being blacklisted..

jspingau commented 5 years ago

closing the issue as it has nothing to do with the library. Thanks for your help!

tdurieux commented 5 years ago

Maybe using a different API key will help. I cannot find the documentation of api.leboncoin.fr, anyone knows how to get an API key?

It is impossible to get, it is not a public API.

Scaleway allows you to change easily your IP, that should help. (if it is not Scaleway that is banished )

jspingau commented 5 years ago

Hi,

For what it's worth: I finally got a chance to setup a new host (changing the IP was a no-no in my case for various reasons). I confirm only my IP was banned and not Scaleway in NL overall :-)

I'll keep updating this thread in case I get blacklisted again.

cheers,

nbusseneau commented 5 years ago

I am also getting similar responses.

Strangely, it does not seem that LBC bans the IP, as it is still working when sending the requests from a browser on the same computer. I am currently checking if adding request headers by copying them from browser to script is working.

alexismoreau commented 5 years ago

how do you send the requests from the browser ?

nbusseneau commented 5 years ago

how do you send the requests from the browser ?

From the Developer Tools, under Network tab. You can craft custom requests from there.

Or, an easier option in our case: just go to LBC and start a search for something: results will appear. At this point, open Developer Tools under Network tab, then click on the Search button directly from the results page: you will see a new request to the API pop up in Network tab. I suspect this is actually how this lib was reverse engineered in the first place ;)

nbusseneau commented 5 years ago

I can confirm that adding the following three headers has solved the issue, albeit it might be temporary:

"User-Agent": "BOOYAKA",
"Accept-Language": "artiche",
"Accept-Encoding": "please do me all the needful",

As you can see, the actual values for these headers do not really matter, it seems just having them present is enough to fool the API.

I will make a PR with more acceptable values.

nbusseneau commented 5 years ago

Apparently the changes in #56 are not enough to fix the issue for @alexismoreau. Perhaps there is more to be discovered? Can we reverse engineer their API blocking strategy? :P

Can you try on your production environment by using actual headers from a browser as suggested in my comment above? See if there's something that works if you actually try to be treated as a regular browser. If it still does not work even with the exact same headers as a browser, is the response still the same or is it something else?

Note: when trying to reproduce the same headers as your browser, do not forget to use a value of * or identity for Accept-Encoding rather than using your browser's values (gzip, deflate, etc.) otherwise the API will send you responses in a compressed format (though you could still uncompress them afterwards, but this requires a bit more work on the request treatment's end than "just" editing headers).

alexismoreau commented 5 years ago

Will try everything you suggested tonight or tomorrow. I'll keep you updated

alexismoreau commented 5 years ago

I tried with : { method: 'POST', hostname: 'api.leboncoin.fr', port: null, path: '/finder/search', headers: { origin: 'https://www.leboncoin.fr', api_key: 'ba0c2dad52b3ec', 'content-type': 'text/plain;charset=UTF-8', accept: '*/*', referer: 'https://www.leboncoin.fr/annonces/offres/ile_de_france/', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.8,fr;q=0.6', 'Accept-Encoding': '*', } the response is still : { url: 'https://c.datado.me/captcha/?initialCid=AHrlqAAAAAMAYdImbmdv7eAAM0sdRg==&hash=05B30BD9055986BD2EE8F5A199D973&t=fe' }

Maybe I'm really blacklisted on my production server. It works locally. We also need to add 'User-Agent', 'Accept-Language' & 'Accept-Encoding' to item.js

nbusseneau commented 5 years ago

We also need to add 'User-Agent', 'Accept-Language' & 'Accept-Encoding' to item.js

Nice catch, hadn't seen that. It seems there are two requests in item.js (one for details and the other for phone number), and fun fact, the second one already has two additional headers: https://github.com/tdurieux/leboncoin-api/blob/5280d9ae90e48f1fc9e7f94f4708dd8e9756512a/lib/item.js#L116-L117

Not sure why it would still display the error when sending requests from your server, unless you have a proxy that strips headers. I would like to know if it helps other blocked people, if yes it would mean the solution works but there is something else going on in your environment.

tdurieux commented 5 years ago

Hi all,

@alexismoreau your IP is probably blacklisted for the moment, you should answer the captcha several times and theoretically it should work again on your server.

@Skymirrh can you add the header for the requests that are missing it? Thanks a lot

jspingau commented 5 years ago

Hello everyone,

My new server got blacklisted again after a few days so I went ahead and added:

"User-Agent": "Gecko",

"Accept-Language": "en-US,en;q=0.8,fr;q=0.6", "Accept-Encoding": "*",

In the header sections of both search.js and item.js and did the trick, Thanks!

It looks like DataDome is pretty sensitive to "Accept-Encoding": "*" . When trying more real life variations of Accept-Encoding (text/html, text/plain, application/json) I got denied every single time, use any sort of junk, including application\json it works... go figure.

I'll continue monitoring and keep you updated during the next few days.

Cheers,

On Tue, Aug 20, 2019 at 10:38 AM Thomas Durieux notifications@github.com wrote:

Hi all,

@alexismoreau https://github.com/alexismoreau your IP is probably blacklisted for the moment, you should answer the captcha several times and theoretically it should work again on your server.

@Skymirrh https://github.com/Skymirrh can you add the header for the requests that are missing it? Thanks a lot

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tdurieux/leboncoin-api/issues/53?email_source=notifications&email_token=AAE42HCNGSEIYU7BFRQMKXDQFOUR7A5CNFSM4IKAFHOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4VQ4YI#issuecomment-522915425, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE42HFCU6FUADMPDGUG6L3QFOUR7ANCNFSM4IKAFHOA .

tdurieux commented 5 years ago

Looking at DataDome, it will be difficult to maintain a reliable service using nodejs. Especially if the technique that this library will use it public.

I am not sure that this library makes sense anymore.

nbusseneau commented 5 years ago

I am just discovering DataDome right now. Perhaps a solution would be to contact LBC and see if they can allow the use of a bot provided we respect their terms, e.g. throttling requests, using a specific user-agent, not allowing pages with more than the default size of 35.

apetitjean commented 5 years ago

In my opinion it doesn't really matter given the fact that scraping LBC is not permitted in the site's general utilization conditions. Hence if we don't hide we are a bot we will be blocked.

See below :

7.2 Il est interdit à tout Utilisateur et Annonceur de copier, modifier, créer une œuvre dérivée, inverser la conception ou l'assemblage ou de toute autre manière tenter de trouver le code source, vendre, attribuer, sous licencier ou transférer de quelque manière que ce soit tout droit afférent aux Eléments.

Tout Utilisateur et Annonceur du Service LEBONCOIN s'engagent notamment à ne pas :

utiliser ou interroger le Service LEBONCOIN, le Service de Paiement Sécurisé et/ou le Service de Livraison pour le compte ou au profit d'autrui ;

extraire, à des fins commerciales ou non, tout ou partie des informations ou des petites Annonces présentes sur le Service LEBONCOIN et sur le Site et les Applications ; reproduire sur tout autre support, à des fins commerciales ou non, tout ou partie des informations ou des petites Annonces présentes sur le Service LEBONCOIN et sur le Site Internet et les Applications permettant de reconstituer tout ou partie des fichiers d'origine ; utiliser un robot, notamment d'exploration (spider), une application de recherche ou récupération de sites Internet ou tout autre moyen permettant de récupérer ou d'indexer tout ou partie du contenu du Site Internet et des Applications, excepté en cas d'autorisation expresse et préalable de LBC France; copier les informations sur des supports de toute nature permettant de reconstituer tout ou partie des fichiers d'origine.

Toute reproduction, représentation, publication, transmission, utilisation, modification ou extraction de tout ou partie des Eléments et ce de quelque manière que ce soit, faite sans l'autorisation préalable et écrite de LBC France est illicite. Ces actes illicites engagent la responsabilité de ses auteurs et sont susceptibles d'entraîner des poursuites judiciaires à leur encontre et notamment pour contrefaçon.

7.3. Les marques et logos Leboncoin et Leboncoin.fr, ainsi que les marques et logos des partenaires de LBC France sont des marques déposées. Toute reproduction totale ou partielle de ces marques et/ou logos sans l'autorisation préalable et écrite de LBC France est interdite.

7.4. LBC France est producteur des bases de données du Service LEBONCOIN. En conséquence, toute extraction et/ou réutilisation de la ou des bases de données au sens des articles L 342-1 et L 342-2 du code de la propriété intellectuelle est interdite.

7.5. LBC France se réserve la possibilité de saisir toutes voies de droit à l'encontre des personnes qui n'auraient pas respecté les interdictions contenues dans le présent article.

7.6. Liens hypertextes

7.6.1. Liens à partir du Service LEBONCOIN et/ou du Service de Paiement Sécurisé

Le Service LEBONCOIN et/ou le Service de Paiement Sécurisé peut contenir des liens hypertextes redirigeant vers des sites exploités par des tiers. Ces liens sont fournis à simple titre d'information.

LBC France n'exerce aucun contrôle sur ces sites et décline toute responsabilité quant à l'accès, au contenu ou à l'utilisation de ces sites, ainsi qu'aux dommages pouvant résulter de la consultation des informations présentes sur ces sites.

La décision d'activer ces liens relève de la pleine et entière responsabilité de l'Utilisateur.

7.6.2. Liens vers le Service LEBONCOIN

Aucun lien hypertexte ne peut être créé vers le Service LEBONCOIN sans l'accord préalable et exprès de LBC France.

Si un internaute ou une personne morale désire créer, à partir de son site, un lien hypertexte vers le Service LEBONCOIN et ce quel que soit le support, il doit préalablement prendre contact avec LBC France en lui adressant un email à l'adresse suivante support@leboncoin.fr.

Tout silence de LBC France devra être interprété comme un refus.

jspingau commented 5 years ago

Hi, I see my server got blacklisted again just after a few hours applying the headers trick... others having the problem confirm this?

[EDIT] I could find somewhat of a workaround, but it is not exactly simple. Still, it worked so I'll share it.

  1. create an SSH tunnel into the blacklisted box and proxy through this tunnel (https://www.systutorials.com/944/proxy-using-ssh-tunnel/)
  2. Go to leboncoin.fr and do the Captcha thing
  3. check for your datadome cookie, and add it to your script request header circa line 242 in search.js

cookie is supposed to be set for a year... I'll keep on monitoring that.

nbusseneau commented 5 years ago

@apetitjean Thanks for the excerpt.

Fun tidbit, which means basically everybody linking to LBC from anywhere is already breaking the TOS:

Aucun lien hypertexte ne peut être créé vers le Service LEBONCOIN sans l'accord préalable et exprès de LBC France.

alexismoreau commented 5 years ago

same as @jspingau , I changed my production server to change my ip, got it working for 100 requests but now I got blocked again on item getDetails API.

nbusseneau commented 5 years ago

I've played a bit with the LBC API, fiddling with the headers and requests salvos timings to see how I could fool DataDome.

What I've found so far:

I suspect this page is triggered by having an API response matching the links above in this thread: https://c.datado.me/captcha/?initialCid={bla bla}. As you can see from the screenshot above, they have a form where you can provide an email to request authorization for bots. It sends you a mail with the following message:

DataDome is missioned by Leboncoin to protects its websites and API against unwanted bot traffic for security and performance concerns.

To do this, we identify bots by analyzing hundreds of attributes and matching them to our knowledge base and/or statistical patterns.

Unidentified bots are automatically categorised by DataDome's algorithms as "bad bots". It seems to be your case since you have been blocked on Leboncoin.

If you wish to regain access, the first step is to authenticate your robot by filling-in the form. After validation, it will be moved from the "bad bot" to the "commercial bot" family and this will give you the opportunity to be whitelisted by the website owner.

Along with a link to this form: https://docs.google.com/forms/d/e/1FAIpQLSenPmi8P-6jAZDq_DeFuXOHCj91J-62d56NhX_YC3JLePBcnw/viewform

I have filled the form with data identifying my personal bot (with a custom user-agent and my own address mail as contact mail). I will keep you up to date as to what they answer.

Disclaimer: I do not think anyone here intends to break LBC rules purposefully and in bad faith (or at least, I don't. I made a bot only to ease the recovery of my stolen bike should it be posted on LBC...). If they answer negatively to the form request, I suspect this means this library will disappear.

In the meantime, if you wish not to get blocked by DataDome, here's how to fool it (at least temporarily):

tdurieux commented 5 years ago

@Skymirrh Thanks for your feedback!

tdurieux commented 5 years ago

I just pushed a refactoring of the requests. Now all requests use the same header and share the DataDom cookies. It should be a little bit more reliable.

tdurieux commented 5 years ago

Hello all,

Do my last changes "fix" the problem?

Thomas

jspingau commented 5 years ago

Hello Thomas,

Thanks for the patch. Jury's still out in my case: It mostly works, but it looks like I'm being blacklisted again for a few hours and then it works again.

Pretty odd situation overall so I'm still monitoring what's going on: It was working on Thursday, blacklisted again on Thursday night through Friday. Saturday and today it worked again...

I'll keep you posted at the end of the week with more solid info.

cheers

On Sun, Aug 25, 2019 at 12:24 PM Thomas Durieux notifications@github.com wrote:

Hello all,

Do my last changes "fix" the problem?

Thomas

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdurieux/leboncoin-api/issues/53?email_source=notifications&email_token=AAE42HFQOWTIBED4YZDU5ADQGJMWZA5CNFSM4IKAFHOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CQVRY#issuecomment-524618439, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE42HGOTDHT23XUHDHRW5TQGJMWZANCNFSM4IKAFHOA .

nbusseneau commented 5 years ago

Do my last changes "fix" the problem?

No change w.r.t. the observations in my last comment: proper headers allow to fool DataDome for a while, but only if you behave "like a human".

Sending too many requests too fast too regularly and with an identical pattern still gets you blocked. You have to make sure the API calls look like a user browsing the website. On my end, the strategies proposed above are working fine:

I have not been blocked since implementing these; I suspect this is enough to fool DataDome's bot detection in my case (non-intensive use).

PS: I also would recommend each individual to customize the User-Agent and Accept-Language headers, as I suspect DataDome might eventually link that specific information tuple with bots using leboncoin-api, thus increasing the probability of scrutiny for your own requests.

PPS: Still no answer from DataDome/LeBonCoin following request for authorization.

ronycohen commented 5 years ago

thanks @Skymirrh

I tried every tricks you listed. But it's working erratically. I'm not even able to access the https://c.datado.me/captcha/?***** url to try to answer the captcha....

I make 3 calls every 4 to 5 minutes.... and for each 403 Forbidden I get the datadome Cookie I use on the next try...

I guess My IP is blacklisted, even swith throttling, and randoming calls etc.. etc.....

nbusseneau commented 5 years ago

Is your bot running 24/7? Because for example I don't think making 3 calls every 4 to 5 minutes BUT 24/7 qualifies as "looking like a human browsing LBC" :D

ronycohen commented 5 years ago

Is your bot running 24/7? NO, not from 10PM to 6AM

I tried to register my bot into Datadome (for personal / not professional use). I don't know if it's a good point.

Currently it's still not working...

jspingau commented 5 years ago

As promised, short update on this issue after 10 days of usage. It was a bit hit or miss a for a few days since my last message, but since then my server has not been blacklisted. Overall, still pretty weird situation. My guess is that the datadome configuration was altered for a few days and then reverted back...

Anyone experiencing being blacklisted again recently?

alexismoreau commented 5 years ago

@Skymirrh have you got any news from Datadome / LBC ?

nbusseneau commented 5 years ago

@alexismoreau Nope. No answer whatsoever.

a-farsi commented 4 years ago

Hi everybody, I got a 403 http error when I try to access the api. Is there any authentication step that I didn't validate before? Thanks in advance

tdurieux commented 4 years ago

Not really, it is leboncoin blocking the requests. You need to create an infrastructure that leboncoin will be not able to detect.

a-farsi commented 4 years ago

Hi tdurieux, Thanks for your answer. the set of api aren't public? Do you have more details about any kind of infrastructure that leboncoin doesn't block? Thanks in advance