henrypoydar / chronic_duration

A simple Ruby natural language parser for elapsed time
MIT License
351 stars 68 forks source link

Date Ranges #31

Open natesire opened 10 years ago

natesire commented 10 years ago

I am writing my own solution to calculate date ranges (e.g. May 22 2014 to June 3 2015) based on chronic. I would gladly contribute this solution if needed.

StephenOTT commented 10 years ago

This already exists https://github.com/tmlee/time_difference

natesire commented 10 years ago

Thanks. I emailed the founder of time_difference. I am actually looking for something that uses machine learning in a natural language approach. I need to parse human written date ranges. I might fork your chronic and post the beginnings of it. I am still deciding on which language to implement the machine learning in. Python has a great NLP TLKT. And C++ for Ruby extensions might take a while. But I even like Scala. Any ideas?

StephenOTT commented 10 years ago

From a ruby perspective do you have a aversion to wrapping chronic with time_duration?

Something like this:

require 'chronic'
require 'time_difference'

humanStatement1 = "this tuesday 1pm"
humanStatement2 = "this tuesday 3pm"

humanStatement1Parsed = Chronic.parse(humanStatement1)
humanStatement2Parsed = Chronic.parse(humanStatement2)

# very human readable version
puts TimeDifference.between(humanStatement1Parsed, humanStatement2Parsed).in_hours  #=> 2.0

# No need for the Prased Variables version
puts TimeDifference.between(Chronic.parse(humanStatement1), Chronic.parse(humanStatement2)).in_hours  #=> 2.0

# Single Line version
puts TimeDifference.between(Chronic.parse("this tuesday 1pm"), Chronic.parse("this tuesday 3pm")).in_hours  #=> 2.0
StephenOTT commented 10 years ago

Use your NLP to tokenize the statements into the start date token and the end date token (humanStatement1 and humanStatement2)

StephenOTT commented 10 years ago

For NLP have you looked at OpenNLP? http://opennlp.apache.org

and then for the ruby bindings, use: https://github.com/louismullie/open-nlp

natesire commented 10 years ago

I am testing time_difference. I didn't even know about openNLP. Awesome. I am checking all of this out.

natesire commented 10 years ago

I have to handle all kinds of weird characters like - / -- & etc... that can be inside and outside parts of the dates. I am going to write the more advanced parsing in Scala.

StephenOTT commented 10 years ago

This is why you have NLP to tokenize your text to remove useless characters or replace the unneeded characters or words.

natesire commented 10 years ago

I see. Tokenization should work. Currently, my algorithm reads the sentence from 0 till chronic returns nil. Then it reads the sentence backwards until the previous nil point. I'll check and see how well tokenization can just provide me two dates.

natesire commented 10 years ago

Here's an example I am running into with chronic. 'Jan first week' is nil 'Jan first' is valid in chronic 'Jan' isn't valid, chronic returns 2015-01-16 12:00:00 -0500

So your idea is to erase 'week' and leave 'first', using tokenization?

natesire commented 10 years ago

I wrote a test in Python.

Here is the output [('Available', 'JJ'), ('June', 'NNP'), ('9', 'CD'), ('--', ':'), ('August', 'NNP'), ('first', 'JJ'), ('week', 'NN')] ['June', '9', 'August'] ['June', '9', 'August']

import nltk import MySQLdb import time import string import re

tokenize

sentence = 'Available June 9 -- August first week' tokens = nltk.word_tokenize(sentence)

parts_of_speech = nltk.pos_tag(tokens) print parts_of_speech

allow only prepositions

NNP, CD

approved_prepositions = ['NNP', 'CD'] filtered = [] for word in parts_of_speech:

if any(x in word[1] for x in approved_prepositions):
    filtered.append(word[0])

print filtered

normalize to alphanumeric only

normalized = re.sub(r'\s\W+', ' ', ' '.join(filtered)) print filtered

natesire commented 10 years ago

I can write a white-list function for words like 'first'. I am really liking this solution. Great idea to tokenize. Now I need a different excuse to write something in Scala. hahahahaha

StephenOTT commented 10 years ago

Here's an example I am running into with chronic. 'Jan first week' is nil 'Jan first' is valid in chronic 'Jan' isn't valid, chronic returns 2015-01-16 12:00:00 -0500

So your idea is to erase 'week' and leave 'first', using tokenization?

for examples like this i would make assumptions about the formats for the dates. Example if someone does "Jan First Week" you use NLP to grab the Month, and they they want Week 1. Then use the ruby date library to grab the day 1 in week 1 and day 7 in week 1.

StephenOTT commented 10 years ago

Take a look at this for an example of grabbing the date of a day number in a week number: http://www.ruby-doc.org/stdlib-2.1.1/libdoc/date/rdoc/Date.html#method-c-commercial

Then use the time_difference library to get the duration.

natesire commented 10 years ago

I wrote a white-list function. Python is handling things beautifully. I can feed the output into chronic. I can call a python script from ruby. Let me know if chronic needs contributions.

StephenOTT commented 10 years ago

great