infectious / adstxtcrawler

Crawls publisher sites and grabs information from their ads.txt files
MIT License
2 stars 1 forks source link

use logic #1

Open mikkokotila opened 6 years ago

mikkokotila commented 6 years ago

Thanks for making this available.

Can you explain the use logic for the system a little.

1) is it correct that first the system will fetch the data and once the list of sites have been handled, the database will be updated fully OR db is updated on the fly

2) once the list of sites is handled the program ends up in a roughly once per 300ms loop where the output is:

2018-09-24 19:12:10,191 - adstxt.main - INFO - Searching for domains to crawl...
2018-09-24 19:12:10,348 - adstxt.main - INFO - '77fe644c572ff1ba8a08-aa3fcb8dba820dc6b4fabb3e45b3ad4d.ssl.cf1.rackcdn.com' found to be an invalid domain.
2018-09-24 19:12:10,467 - adstxt.main - INFO - Done processing current available domains.

What's the point of this? If the idea is that when the text field is updated with new domains, those will be scanned, that's not happening.

3) There is also the issue where many runs end up with a lot of NULL values (instead of 0 or 1) in the adstext_present variable

Do you think you could test once in your end with sqllite and once with mysql, and then based on that provide the actual code complete example in the README for both. Something where the user literally needs to do nothing but replace the parameters with their own information.

I think that would be really useful for those interested in adopting this.

Poogles commented 6 years ago

No worries thank you for taking interest.

  1. DB is updated in a background thread from the crawler, the crawler logic normally completes before the database logic so there will be a while where results are being written to the database and nothing is crawled.
  2. We've normally got a long enough list of domains that crawling runs pretty much constantly. Later this week I'll add a commit that does some back-off so we're not looping on that so much. Files will be re-read every iteration here and here, I'll add a test to ensure that happens but I can't see anything broken.
  3. Nothing should be null at the end of a run. Domains are parsed from the input (either file or elasticsearch) and are written to the domains table with a null (unset) adstxt_present and a last_updated set to 0, once all domains have been crawled inside of a run the table should be updated and no null values present. Again I'll add a test for this later this week.

On the third point we've made some assumptions about how you're accessing the data which might not be the same assumptions you've made. I'll add some more docs around how the logic flow works in the next few days and what consistency guarantees are made.

I'll update the README to be a bit more explicit as to how to get started providing a couple of env var exports and a dummy CLI command shortly.

mikkokotila commented 6 years ago

Thanks. Regarding point 2, did I understand correct that there can be several files for input? What would be the syntax for inputting more than one file?

I will do some more testing as well (you could try to reproduce simply by creating a crawl with 100 sites in a list and then adding sites to that 100 and see the results).

Finally I'm not sure I understand what you mean with "pretty much constantly"...my site list is about 4 million sites initially, and then new sites come every day maybe in the order of 10^4 to 10^5 and it's hard to imagine your use-case would be much different i.e. such a scan would easily be processed in a small fraction of a day.

Poogles commented 6 years ago

Thanks. Regarding point 2, did I understand correct that there can be several files for input? What would be the syntax for inputting more than one file?

No just the one. It would be simple for me to switch to globbing (e.g. /foo/bar/baz/*.csv) though which is a good idea as you could provide a single math or multiple with one path.

Finally I'm not sure I understand what you mean with "pretty much constantly"...my site list is about 4 million sites initially, and then new sites come every day maybe in the order of 10^4 to 10^5 and it's hard to imagine your use-case would be much different i.e. such a scan would easily be processed in a small fraction of a day.

We scan between 10^5 and 10^6 daily over a dynamic list of sites rather than one hard coded list (we write them to a mounted share based upon what we've seen over the past few hours). Due to the list changing frequently we just leave the process to loop over. I'm not sure how long it'll take to scan ~4 million domains but I'll grab a domains list and do some testing.