sailuh / perceive

PERCEIVE is a project incubator inspired by Apache Incubator and Stack Exchange's Area 51. It serves as a staging zone repository for the project early ideas.
http://sailuh.github.io/perceive
GNU General Public License v2.0
11 stars 22 forks source link

Enhancement to Full Disclosure Crawler and Parsers #92

Open jgwl opened 6 years ago

jgwl commented 6 years ago

Taken from https://github.com/sailuh/perceive/pull/74

1. seclists_crawler_raw.py

1.1 Still doesn't provide an optional flag as save path.

Output parameter -o

For both Crawler and Parser, rather than default to save in the folder the script is run, an optional parameter -o could be useful for both Crawler and Parser. For us who will be versioning the code, this would help avoiding having to move the files manually every time we download a new month and makes it more scriptable from the command line.

Note also the expected behavior (although intuitively I see where you are going) is inconsistent in the 2 scripts which may leave a student confused: The Crawler script downloads in the same folder. The Parser script downloads at the provided input instead of where the script is run.

1.2 README.md

Should mention what the user is expected to be downloaded. Currently, it is each individual e-mail html page + an index.html page whose name format is _.raw.html. Main difference being the absence of a relative id in the file name.

2. seclists_index_parse.py

2.1 Script help message example is incorrect (?)

-f , parse single raw file, e.g. -f ./2011_Jan_0.raw.html

From your README.md (very nicely done by the way), I assume this would be without the 0 in it? i.e. 2011_Jan.raw.html.

2.2 Lacks save path

Currently adds to the input path directory. should mention on README.md possible-follow ups case

should mention in the readme "possible-follow ups" are added to the parser the same way as follow-ups without any "possible" statement.

3. Add some python tests to ensure consistency across the scripts

Given it is hard to see from the results files are missing now or in the future, it would be interesting to have tests that:

This should suffice to minimally check all scripts are working consistently. Additional tests could include for example checking that the number of authors are correct, and the number of e-mail parents matches the expected.

General Notes

5. Missing requirements.txt with python libraries.

6. Parent README.md

Should probably add a parent folder to both Crawler and Parser with a readme mentioning the existence of the 3 scripts, a 1 line statement of what they do, and the agreed taxonomy of the file names.

carlosparadis commented 6 years ago

@jgwl thank you for putting it together on a new issue and wanting to wrap up on this! :-)