Website crawler and attached search engine for the results.
Handle this: http://dblp.uni-trier.de/db/journals/phat/ and all other journals check for 404s in raw files manually
Is done with Nose https://www.google.de/search?q=python+nosetests&gws_rd=ssl
Instead of having different indexes, we can just multiply the number of occurences of a word by n(say 4) if it appears in a title to boost the title's weight.
In the parser, change words like "writing" to "writ" so if a user searches for "write", it will be found (by converting it as well to "writ"). Also possible to tweak on -: "Hola WHATS-up" -> "hola whats up whatsup whats-up"
In charge of crawling the websites html files in a filedirectory in the server.
In charge of parsing the raw input text, outputting structed terms, usable by the intermediate.
Builds intermediate lists to be used by the indexer in the following form (or different):
term1: { set of docs with number of ocurrences in each doc } example: doc1:3, doc4:6, ... term2: ... ...
In charge of building a term dictionary or index based on documents links, term frequencies and documents frequencies. hero lists/tier-based lists.
Keep additional fields/indexes for each term:
On query search, search all these indexes separately and combine results.
Based on the query, run an algorithm which looks in indexes the relevant results.
Search for
"watch" finds
doc1 and doc3 in index 'body' with tf-idf weights 2.3 and 0.7 respectively, and doc3 in index 'title' with tf-idf 1. We assign (hard-coded) an amplifier for the title index of 2.0, ISBN of 3.0 and Type of 2.5.
Now we combine the results of all the index search, suming up the amplified weights so we get as a final result-list:
[doc3:2.7, doc1:2.3]