Week 17: Implementation of Document Search

@cctoombs @twood02

What I have been working on was building the skeleton for the keyword search engine within the documents in the 40Percent file.

What I can accomplish within the week is that I can actually resolve the automatic search engine where the library stems the document and utilizes a list of generic stop words to try to identify the key components of the document. The weakness of this is that it still results in the same weaknesses as Lingua::EN::NamedEntity, if there it isn't document specific and word weighted. What I'm going to do is to use GATE::ANNIE::Simple once I rebuild the Java Bridge in Perl to identify the proper nouns (names, location, time) and place them in their own class so that I can print to the user specific class and then use a simple grep or regex to assist in clearing that data from the file. The only challenge for direct implementation is that I need to be able to remove superfluous information with regex prior to get the best results. I've tried different regex's and haven't been able to remove Sections or qq{} and etc. without sometimes affecting the document namely with the section or chapter portion of the regex. I think the solution might be to just take out any new line with 20 chars or less.

Furthermore, I implemented a vector space search algorithm into the document so that I can use a document to build its now query to search through the document to account for the manual inputs that a user can contribute. While this algorithm was easy as it provided me with a few subroutines to implement, I think the subroutines are highly incomplete and I need to do modification to them so that they're not repeating a lot of the information to the user. I'm going to write a subroutine that basically takes the output of GATE::ANNIE:Simple and automatically places it in the query

I think the biggest challenge for me right now, since I've started is not just finding the right tools but trying to inject my own code and ideas as most of the libraries are highly function. Especially since I'm going to implement Algorithm::NaiveBayes, I feel like most of the algorithm is how well I manipulate and use regex's. I was hoping if either one of you had any ideas regarding this.

gwsd2015 / Emoveo

Week 17: Implementation of Document Search #3