Closed retorquere closed 2 months ago
hey Emiliano, good idea. I've changed the tokenizer to allow for 3 slashes per word.
nlp(`IEEE/WIC/ACM`).match('wic').found //true
note that the words are not actually split. we have an awkward, but safe interpretation of slashes, so they don't get bunged-up by other transformations.
cheers
released in 14.13.0
, thanks
on 14.13 I still see
const nlp = require('compromise/one')
const doc = nlp('IEEE/WIC/ACM')
for (const sentence of doc.json({offset:true})) {
for (const term of sentence.terms) console.log(term)
}
showing one token with the combined words. How do I extract the separate words in 14.13?
Hey, sorry for delay - yes, this is not possible now, but is a good idea. The thinking was that slashed words should be one term, for most purposes, but should be able to be accessed individually with matches and things. You can see the slashed words are tokenized in the .json() response in an 'alias' property.
It would be cool (and possible) to add a .slashes().split() method. I can try to add it in an upcoming release Cheers
I find them there, but they've been lowercased. I use the tokenizer for a sentence-casing algorithm so I need case intact.
Would the split method recreate location info? And this slashes.split would be something I would run on individual terms?
hey Emiliano, I've added .slashes().split()
method in 14.14.0
- it seems to be working well - let me know if you find any issues with it.
cheers
I'll give it a go, but I've implemented something else in the meantime that works reasonably well.
How do I use this? I've lifted this from the tests:
let doc = nlp(`i saw him/her yesterday at 2pm.`)
let m = doc.slashes()
but that gets me
doc.slashes is not a function
Ah wait, this doesn't work on `compromise/one'?
Slashes is a recent feature, please ensure you’ve updated to the latest version. Cheers
On Fri, Aug 30, 2024 at 3:51 AM Emiliano Heyns @.***> wrote:
Ah wait, this doesn't work on `compromise/one'?
— Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/1100#issuecomment-2320389009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBSKKT5NRK4ZUU3MQFGZLZUAQAHAVCNFSM6AAAAABFRRI72WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRQGM4DSMBQHE . You are receiving this because you modified the open/close state.Message ID: @.***>
I'm on 14.14.0. I don't really get yet how slashes
should be used but looking at the tests I assumed this should at least be valid code:
import nlp from 'compromise/one'
const doc = nlp('The multi-part grain-based formulae-as-types notion of construction more/than/this v0.9')
let m = doc.slashes()
but that gets me
doc.slashes is not a function
whereas this does pass:
import nlp from 'compromise'
const doc = nlp('The multi-part grain-based formulae-as-types notion of construction more/than/this v0.9')
let m = doc.slashes()
It looks like .slashes
extracts the terms that have slashes, but am looking for a way to extract all terms of a sentence, with the slashes split. Is that possible?
hey Emiliano, yep- the .slashes()
command is in compromise/three
, so will not appear in compromise/one
.
I know it's not clear, but everything under the compromise/three header is part of that export.
If you needed it to work in compromise/one, you may have some luck plucking it as a plugin, from here cheers
It looks like
.slashes
extracts the terms that have slashes, but am looking for a way to extract all terms of a sentence, with the slashes split. Is that possible?
but is this possible with .slashes? I don't know how to add a plugin to compromise, and if it won't do what I need, there's no need to figure it out.
I'm tokenizing using
compromise/one
. Can I have'IEEE/WIC/ACM'
be recognized as 3 slash-separated words rather than one?