nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

node.js port ? #94

Open chopinml opened 3 years ago

chopinml commented 3 years ago

Hello @nipunsadvilkar ,

Thank you for your efforts to port Ruby library to Python.

Do you see any benefit it to port JavaScript (node.js) library as well? And I wonder three things

1) Is main Ruby Pragmatic Segmenter repository being updated frequently ? 2) Do you still watch the main Ruby repository so also port changes to your ported versions ? 3) Does this library diverging from some regex rules because of some issues which people not reported on Ruby repo but for this Python repo?

Congrats for your effort !

nipunsadvilkar commented 3 years ago

Hi @chopinml Thanks for the kind words

I'm not aware of any JS port of it, you may want to look at ruby versions readme for that.

As you must've seen pySBD is heavily inspired from pragmatic_segmenter so all regex rules are implemented in python as well there could be difference in working of them as regex engine differs across both - python and ruby.

In addition to exisiting functionalities of pragmatic_segmenter, I've added few more functionalities beneficial for NLP community like getting character offsets of sentences. Such functionalities you may not find in other implementations.

Now the way further pySBD developments are happening are on rolling basis, as new issues/PRs are submitted, I review them and get it merge in consecutive release. Such issues/PRs may or may not be related to pySBD's missing functionalities which exists in other implementations.

Hope above answers helps! Thanks

chopinml commented 3 years ago

I haven't seen any JS port, first found the Ruby repo and found yours and the other C# port from that README section 😃

I guess the NLP community is heavily using Python therefore node.js libraries are not very common for NLP tasks, but I'm actually trying to build a bilingual corpus from the web. So I saw that I will need a sentence aligner and browsing GitHub.

Do you have any experience in node.js, is it faster for web crawling and text operations than Python ?

ghost commented 1 year ago

@chopinml you can always use OS shell outputs to grab results from Python. I have a similar setup for another project:

#ftfy-wrapper.py
import sys

import ftfy

text = sys.argv[1]
clean = ftfy.fix_text(text)
print(clean)
sys.stdout.flush()

This is my NodeJS code:

import { execFile, spawn } from 'node:child_process';

import path from 'path';
import { fileURLToPath } from 'url';
const __filename = fileURLToPath(import.meta.url)
const __dirname = path.dirname(__filename)
let arg_ftfy = path.join(__dirname, './ftfy-wrapper.py')

export function mojibakeFixer(arg_text) {
    return new Promise((resolve, reject) => {
        execFile('python-ftfy/.venv/Scripts/activate', function () {
            const ftfyApp = spawn('python', [arg_ftfy, arg_text], { env: { PYTHONIOENCODING: 'utf8' } });
            ftfyApp.stdout.on('data', (data) => {
                resolve(data.toString())
            });
            ftfyApp.stderr.on('data', (err) => {
                reject(err.toString())
            });
            ftfyApp.on('close', (code) => {
                reject(`child process exited with code ${code}`)
            });
        })
    })
}

Making pySBD runnable by a simple script like my ftfy-wrapper.py is a challenge I guess, since it uses much dependencies, so you need to make an environment for that and activate it like I do.

I hope this helps.