Open mikkokotila opened 6 years ago
No worries thank you for taking interest.
On the third point we've made some assumptions about how you're accessing the data which might not be the same assumptions you've made. I'll add some more docs around how the logic flow works in the next few days and what consistency guarantees are made.
I'll update the README to be a bit more explicit as to how to get started providing a couple of env var exports and a dummy CLI command shortly.
Thanks. Regarding point 2, did I understand correct that there can be several files for input? What would be the syntax for inputting more than one file?
I will do some more testing as well (you could try to reproduce simply by creating a crawl with 100 sites in a list and then adding sites to that 100 and see the results).
Finally I'm not sure I understand what you mean with "pretty much constantly"...my site list is about 4 million sites initially, and then new sites come every day maybe in the order of 10^4 to 10^5 and it's hard to imagine your use-case would be much different i.e. such a scan would easily be processed in a small fraction of a day.
Thanks. Regarding point 2, did I understand correct that there can be several files for input? What would be the syntax for inputting more than one file?
No just the one. It would be simple for me to switch to globbing (e.g. /foo/bar/baz/*.csv
) though which is a good idea as you could provide a single math or multiple with one path.
Finally I'm not sure I understand what you mean with "pretty much constantly"...my site list is about 4 million sites initially, and then new sites come every day maybe in the order of 10^4 to 10^5 and it's hard to imagine your use-case would be much different i.e. such a scan would easily be processed in a small fraction of a day.
We scan between 10^5 and 10^6 daily over a dynamic list of sites rather than one hard coded list (we write them to a mounted share based upon what we've seen over the past few hours). Due to the list changing frequently we just leave the process to loop over. I'm not sure how long it'll take to scan ~4 million domains but I'll grab a domains list and do some testing.
Thanks for making this available.
Can you explain the use logic for the system a little.
1) is it correct that first the system will fetch the data and once the list of sites have been handled, the database will be updated fully OR db is updated on the fly
2) once the list of sites is handled the program ends up in a roughly once per 300ms loop where the output is:
What's the point of this? If the idea is that when the text field is updated with new domains, those will be scanned, that's not happening.
3) There is also the issue where many runs end up with a lot of NULL values (instead of 0 or 1) in the adstext_present variable
Do you think you could test once in your end with sqllite and once with mysql, and then based on that provide the actual code complete example in the README for both. Something where the user literally needs to do nothing but replace the parameters with their own information.
I think that would be really useful for those interested in adopting this.