spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.43k stars 656 forks source link

Feature: .slashes() tokenize transform #1100

Closed retorquere closed 2 months ago

retorquere commented 6 months ago

I'm tokenizing using compromise/one. Can I have 'IEEE/WIC/ACM' be recognized as 3 slash-separated words rather than one?

spencermountain commented 6 months ago

hey Emiliano, good idea. I've changed the tokenizer to allow for 3 slashes per word.

nlp(`IEEE/WIC/ACM`).match('wic').found //true

note that the words are not actually split. we have an awkward, but safe interpretation of slashes, so they don't get bunged-up by other transformations.

cheers

spencermountain commented 6 months ago

released in 14.13.0, thanks

retorquere commented 6 months ago

on 14.13 I still see

const nlp = require('compromise/one')
const doc = nlp('IEEE/WIC/ACM')

for (const sentence of doc.json({offset:true})) {
  for (const term of sentence.terms) console.log(term)
}

showing one token with the combined words. How do I extract the separate words in 14.13?

spencermountain commented 6 months ago

Hey, sorry for delay - yes, this is not possible now, but is a good idea. The thinking was that slashed words should be one term, for most purposes, but should be able to be accessed individually with matches and things. You can see the slashed words are tokenized in the .json() response in an 'alias' property.

It would be cool (and possible) to add a .slashes().split() method. I can try to add it in an upcoming release Cheers

retorquere commented 6 months ago

I find them there, but they've been lowercased. I use the tokenizer for a sentence-casing algorithm so I need case intact.

retorquere commented 6 months ago

Would the split method recreate location info? And this slashes.split would be something I would run on individual terms?

spencermountain commented 2 months ago

hey Emiliano, I've added .slashes().split() method in 14.14.0 - it seems to be working well - let me know if you find any issues with it. cheers

retorquere commented 1 month ago

I'll give it a go, but I've implemented something else in the meantime that works reasonably well.

retorquere commented 1 month ago

How do I use this? I've lifted this from the tests:

let doc = nlp(`i saw him/her yesterday at 2pm.`)
let m = doc.slashes()

but that gets me

doc.slashes is not a function
retorquere commented 1 month ago

Ah wait, this doesn't work on `compromise/one'?

spencermountain commented 1 month ago

Slashes is a recent feature, please ensure you’ve updated to the latest version. Cheers

On Fri, Aug 30, 2024 at 3:51 AM Emiliano Heyns @.***> wrote:

Ah wait, this doesn't work on `compromise/one'?

— Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/1100#issuecomment-2320389009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBSKKT5NRK4ZUU3MQFGZLZUAQAHAVCNFSM6AAAAABFRRI72WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRQGM4DSMBQHE . You are receiving this because you modified the open/close state.Message ID: @.***>

retorquere commented 1 month ago

I'm on 14.14.0. I don't really get yet how slashes should be used but looking at the tests I assumed this should at least be valid code:

import nlp from 'compromise/one'
const doc = nlp('The multi-part grain-based formulae-as-types notion of construction more/than/this v0.9')
let m = doc.slashes()

but that gets me

doc.slashes is not a function

whereas this does pass:

import nlp from 'compromise'
const doc = nlp('The multi-part grain-based formulae-as-types notion of construction more/than/this v0.9')
let m = doc.slashes()
retorquere commented 1 month ago

It looks like .slashes extracts the terms that have slashes, but am looking for a way to extract all terms of a sentence, with the slashes split. Is that possible?

spencermountain commented 1 month ago

hey Emiliano, yep- the .slashes() command is in compromise/three, so will not appear in compromise/one. I know it's not clear, but everything under the compromise/three header is part of that export.

If you needed it to work in compromise/one, you may have some luck plucking it as a plugin, from here cheers

retorquere commented 1 month ago

It looks like .slashes extracts the terms that have slashes, but am looking for a way to extract all terms of a sentence, with the slashes split. Is that possible?

but is this possible with .slashes? I don't know how to add a plugin to compromise, and if it won't do what I need, there's no need to figure it out.