spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.43k stars 656 forks source link

"to" is a preposition and not a conjuction #1107

Open NikhilVerma opened 5 months ago

NikhilVerma commented 5 months ago

https://www.dictionary.com/browse/to

I am trying to build a sentence separator which can split a sentence if it has multiple verb or noun conjunctions.

The current approach is to do something like this

    const conjunctionSplit = doc
        .splitOn("#Adverb? #Verb (#Conjunction|,)")
        .splitOn("(#Conjunction|,) #Adverb? #Verb");

However a sentence like "An organisation should make best efforts to protect it's hardware and software." gets parsed as

[
    "An organisation should make best efforts",
    "to protect",
    "it's",
    "hardware and",
    "software."
]

which should be parsed as

[
    "An organisation should make best efforts to protect it's",
    "hardware and",
    "software."
]

My current workaround is to do this:

world.model.one.lexicon.to = "Preposition";

It's awesome that compromise let's me edit the lexicon so easily. But I think it should be updated in the main library as well

spencermountain commented 5 months ago

hey Nikhil, yep you're right - looks like a mis-tagging by compromise in this case. I'm happy to check it out for the next release thanks for the heads-up cheers

spencermountain commented 3 months ago

hey, longer answer this time: the Penn Tagset has a whole new part-of-speech tag for TO, which I think is why it became a Conjunction in the test-set I used, and why we call it a conjunction by default in compromise. I changed it now, and a billion tests failed. This change should probably be in a major release.

Personally, i've never been clear on the difference - 'head and tail' vs 'head to tail'. I'd love to know if you, (or anyone!) has any opinions on this of any strength - they both seem to do the same thing, to me.

gonna punt this for now. Thank you for flagging it to me cheers