Open nicolas-slusarenko opened 6 years ago
Although the cruncher is in a basic level of development, the just committed version delivers interesting results.
To reproduce these results download the text of this web page https://en.wikipedia.org/wiki/David_Hilbert and save it in /tmp/someText.txt
Executing the cruncher I got in these first lines,
$ ./bin/cruncher occ: [14] seq: [of] occ: [10] seq: [and] occ: [10] seq: [the] occ: [6] seq: [of the] occ: [6] seq: [hilbert] occ: [4] seq: [theory] occ: [4] seq: [as] occ: [4] seq: [a] occ: [3] seq: [is] occ: [3] seq: [his] occ: [3] seq: [one] occ: [3] seq: [one of] occ: [3] seq: [one of the] occ: [3] seq: [mathematical] occ: [3] seq: [in] occ: [3] seq: [theory and] occ: [2] seq: [mathematics] occ: [2] seq: [20th] occ: [2] seq: [as one] occ: [2] seq: [to] occ: [2] seq: [and developed] occ: [2] seq: [as one of] occ: [2] seq: [as one of the] occ: [2] seq: [set] occ: [2] seq: [he] occ: [2] seq: [developed] occ: [2] seq: [for] occ: [1] seq: [of the mathematical research] occ: [1] seq: [of the mathematical research of the 20th century hilbert] ..
Which looks as an squeezing out of the relevant information in the web page. Work is on-going!
For every page crunched its texts sequences are extracted, their relevance computed and inserted in the database as shown in this snippet of the output of cruncher.
processing page: 1 index: 361948 url: 'http://en.wikipedia.org/wiki/WikiHow' level: 3'
extracting sequences ...
wordCount: 3260 textLength: 16089
seq: [wikipedia] body: [5] url: [25] title: [47] total: [5875]
seq: [wikihow] body: [48] url: [1] title: [36] total: [1728]
seq: [en wikipedia org w] body: [1] url: [41] title: [35] total: [1435]
seq: [wikipedia en wikihow] body: [1] url: [18] title: [74] total: [1332]
seq: [en wikihow from wikipedia] body: [1] url: [15] title: [61] total: [915]
seq: [wikipedia en wikihow from] body: [1] url: [15] title: [61] total: [915]
seq: [page wikipedia en wikihow] body: [1] url: [15] title: [61] total: [915]
seq: [wikipedia en wikihow from wikipedia] body: [2] url: [10] title: [43] total: [860]
seq: [https: en wikipedia org w] body: [1] url: [31] title: [26] total: [806]
seq: [en wikipedia org w index] body: [1] url: [31] title: [26] total: [806]
This log also points to an interesting fact. I expect that the text sequence wikihow would be the most relevant in en.wikipedia.org/wiki/WikiHow, but the most relevant is wikipedia. What happened?
Despite that the sequence wikihow is the most relevant in the body and very relevant in the title, it is the lowest possible in the URL. This happened because the algorithm that looks for the sequence in the URL is case-sensitive. There is something to improve.
Now, the algorithm to compute the relevance of a sequence in the URL or title, use a lower case copy of the URL or title. The ranking is better compared to the previous version as shown in this log output because wikihow is the most relevant text sequence in the en.wikipedia.org/wiki/WikiHow page.
processing page: 1 index: 361948 url: 'http://en.wikipedia.org/wiki/WikiHow' level: 3'
extracting sequences ...
wordCount: 3260 textLength: 16089
seq: [wikihow] body: [48] url: [19] title: [36] total: [32832]
seq: [wikipedia] body: [5] url: [25] title: [47] total: [5875]
seq: [wikipedia en wikihow] body: [1] url: [50] title: [74] total: [3700]
seq: [wikipedia en wikihow from wikipedia] body: [2] url: [29] title: [43] total: [2494]
seq: [wikipedia en wikihow from] body: [1] url: [40] title: [61] total: [2440]
seq: [page wikipedia en wikihow] body: [1] url: [40] title: [61] total: [2440]
seq: [en wikihow from wikipedia] body: [1] url: [40] title: [61] total: [2440]
seq: [wikihow from wikipedia] body: [1] url: [35] title: [67] total: [2345]
seq: [wikihow com] body: [6] url: [13] title: [25] total: [1950]
seq: [en wikihow from wikipedia the] body: [1] url: [36] title: [53] total: [1908]
seq: [wikihow from wikipedia the] body: [1] url: [30] title: [58] total: [1740]
The structure pageInfo that keep the information extracted from the page under crunching is transformed in a class.
The URLs are rejected if they are too long, if they include single quotes or point to not supported file types.
Also, a limit is set on how many URLs could be injected into the database from the web page under crunching.
Remove excessive white space from title before injecting into the database. Remove an unnecessary commented line.
It is possible to set restrictions on the pages to crunch based in their domain. Introduced the class Trokam::Exception which carries information of the error, and if available, information of the page under crunching.
A program should search for text sequences and their relevance, instead of isolated words as it was done in Step1.