machawk1 / cdxjGenerator

A script to generate CDXJ TimeMaps for testing elsewhere
MIT License
1 stars 1 forks source link

Provide a flag to make the results realistic #16

Open machawk1 opened 4 years ago

machawk1 commented 4 years ago

In PR#14, @ibnesayeed suggested me limit the dates generated from the epoch to the current date/time based on the default behavior of the Faker module.

This functionality would be useful, but it is also useful to generate results that conform to outside specs like WARC/1.1 WARC-Date and ISO8601, which allow dates beyond this range.

There should still be a way to enforce realistic results to be generated. Let's put this behind a command-line flag, e.g., --realistic. This might also be used to enforce the distribution of MIME types and status codes per #13.

ibnesayeed commented 4 years ago

We should also take into account the fact that in realistic case URIs should repeat a lot.

machawk1 commented 4 years ago

@ibnesayeed Do you have any information on long tail distribution with your profiling research on which we could base this (e.g., #s)?

ibnesayeed commented 4 years ago

@ibnesayeed Do you have any information on long tail distribution with your profiling research on which we could base this (e.g., #s)?

We do, but perhaps that will be a stretch. We can simply have a weighted toss whether to use previous URI. By adjusting the threshold we can adjust less or more repetitions.

machawk1 commented 4 years ago

What sort of weighting do you recommend? 0.9? Should it degrade with each new URI-R or each time a URI-R is repeated for the respective URI-R?

ibnesayeed commented 4 years ago

I would keep that number much lower because a big portion of URIs is never repeated, so we want to preserve that behavior. A static value will be good enough initially, but if we want it dynamic, then initial threshold should be small, but gradually increase up to a limit, if repeated.

machawk1 commented 4 years ago

I would keep that number much lower because a big portion of URIs is never repeated

This is the sort of information I am fishing for here. Should it be 0.25? 0.5? 0.62? What is reasonable from your experience? I now know that 0.9 is not.

ibnesayeed commented 4 years ago

For a static value, I would keep it something like 30-40%. However, if we were to make it dynamic, I would start from 10% and gradually increase it to something like 70-80%. The idea here is, a URL is less likely to be archived again, but once it is archived twice it has a higher chance of getting archived again and the chance grows with each duplicate. However, we do not want to chance to be so high (say, 95%) that it gets stuck there and cannot easily jump out of it to a new URL.

That said, I am not sure how likely it is for the Faker to generate duplicates when a lot of URLs are generated randomly. If it generates duplicates often then we do not need to work too hard, but if it does not, then we will need to put some logic in place.