kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
68 stars 21 forks source link

Tokenization #91

Closed codito closed 5 years ago

codito commented 6 years ago

In Level 2, should the sentence be tokenized first before lexical splits? E.g. say with whitespace (to start with) as a token boundary.

If we tokenize the sentence first, may be following output paths could be pruned (see below for complete output)

[kaH, cit, naraH, vAna, arI]                                                                                                                           
[kaH, cit, na, raH, vAna, arI]

Current output

> python -m sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer "kaScit naraH vA nArI" --debug --split
Input String: kaScit naraH vA nArI                                                                                                                             
Input String in SLP1: kaScit naraH vA nArI                                                                                                                     
Start Split
End DAG generation                                                                                                                                             
End pathfinding 1527393212.680358                                                                                                                              
Splits:
[kaH, cit, naraH, vAna, arI]                                                                                                                                   
[kaH, cit, naraH, vAH, nArI]                                                                                                                                   
[kaH, cit, naraH, vA, nArI]                                                                                                                                    
[kaH, cit, na, raH, vAna, arI]                                                                                                                                 
[kaH, cit, naraH, vAH, na, arI]                                                                                                                                
[kaH, cit, naraH, vA, na, arI]                                                                                                                                 
[kaH, cit, na, raH, vAH, nArI]                                                                                                                                 
[kaH, cit, naraH, vA, AnA, arI]                                                                                                                                
[kaH, cit, na, raH, vA, nArI]                                                                                                                                  
[kaH, cit, naraH, vA, A, nArI]
-----------
Performance
Time for graph generation = 0.024774s
Total time for graph generation + find paths = 0.032885s
kmadathil commented 6 years ago

We should. This was done in SanskritLexicalAnalyzer at one point, but we may have inadvertently lost it when we moved towards generating all splits at once.

I still see the code which used to do this, this probably doesn't work right now?

        spos = s.find(' ')                                                                                                                                                                                                                                            
        if spos != -1:                                                                                                                                                                                                                                                
            # Replace the first space only                                                                                                                                                                                                                            
            s = s.replace(' ', '', 1)                                                                                                                                                                                                                                 

        s_c_list = _sandhi_splits_all(s, start=0, stop=spos + 1)                                                                                                                                                                                                      
        logger.debug("s_c_list: " + str(s_c_list))                                                                                                                                                                                                                    
codito commented 6 years ago

I will debug this further.

avinashvarna commented 6 years ago

@kmadathil This code should still work. I remember looking at this when I was making the graph optimizations. However, it won't apply to the example that @codito gave. "vA nArI" will have the first space replaced and the input to the sandhi splitter would be "vAnArI" for which "vAna arI" is a valid split.

@codito Can you please elaborate on what sort of tokenization you have in mind? I don't think splitting at spaces in the input will help, because of cases like "ta eva" which would have to be split as "te eva".

The scoring based approach to reorder the results (and perhaps later apply a threshold to remove unlikely splits) would probably work better. Let me test and get back.

codito commented 6 years ago

@avinashvarna I had whitespace based split in mind assuming two aspects about user intention:

  1. lexical analysis should split words wherever possible. Using a whitespace between words would mean that user already has intentionally provided certain boundary.
  2. And (1) may not include generating combinations across boundaries (combining words across the whitespace and then splitting it).

I am not sure if (2) is a valid grammatical assumption in sanskrit. E.g. in english this may not be valid going rain => go-ing rain => go ingrain. Is this derivation allowed in sanskrit ta eva => taeva => te eva?

Tried a simple code change (commit) for the whitespace split. It seems to generate the expected output.

> python -m sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer "ta eva" --debug --split
Input String: ta eva
Input String in SLP1: ta eva
Start Split
End DAG generation
End pathfinding 1527648553.8241742
Splits:
[ta, eva]
[te, eva]
[taH, eva]
[ta, e, va]
[te, e, va]
[taH, e, va]
-----------
Performance
Time for graph generation = 0.026064s
Total time for graph generation + find paths = 0.028546s

Logs

DEBUG:sanskrit_parser.lexical_analyzer.sandhi:Trying after ta ev
DEBUG:sanskrit_parser.lexical_analyzer.sandhi:Found split ta ev -> ('te', 'ev') (ach_sandhi.txt:47)
DEBUG:sanskrit_parser.lexical_analyzer.sandhi:Found split ta ev -> ('to', 'ev') (ach_sandhi.txt:49)
DEBUG:sanskrit_parser.lexical_analyzer.sandhi:Trying after ta eva
DEBUG:sanskrit_parser.lexical_analyzer.sandhi:Split: ta eva, 1
avinashvarna commented 6 years ago

If we agree that (1) is a reasonable assumption/restriction, then along with your code change, the sandhi splitting/joining rules could be modified to only split at places where typically spaces are written even after sandhi, such as after what would otherwise be a visarga/ma-kAra (loosely speaking).

The problem is that some people (including @vvasuki at one point in time) have taken to writing sandhi forms with spaces in the middle. E.g. इत्य् अपि (ity api). These would not be correctly recognized if we enforce the rule that a space denotes a word boundary.

We could always add another option to choose between these two behaviors, but the code complexity will go up.

vvasuki commented 6 years ago

(Technically, the whitespace in इत्य् अपि still demarcates the word boundary - इत्य् has the pada-संज्ञा even after इ->य् substitution.)

codito commented 6 years ago

Google is also suggesting usage of इत्य् in several texts. Isn't this a case of missing entry for इत्य् in data sources (inria, sanskrit_util)?

avinashvarna commented 6 years ago

@vvasuki I think the पदसंज्ञा is not the same as what would generally be called a word boundary, but that is a tangential discussion,

@codito इत्य् in this context is not a separate form, but rather the result of applying यण्-सन्धि written with a space between the constituent words.

I think the summary is that we cannot rely on spaces. Does anyone disagree?

codito commented 6 years ago

@avinashvarna evidences so far suggest not relying on spaces.

(my sanskrit knowledge is limited, I'm trying to learn more) Can someone help find if there are any references to whitespace in our grammar literature? Why would an author ever use whitespace in a sanskrit sentence? :)

Are there any specific subsets of sandhi rules such cases like इत्य् apply to?

If we don't have any rules around whitespace as such, I think it's fair to prefer not missing a possible split. We can ask the user to do a split on whitespace before using sanskrit_parser if they prefer so.

vvasuki commented 6 years ago

Can someone help find if there are any references to whitespace in our grammar literature?

Zilch. Sanskrit grammar says nothing about writing style.

Why would an author ever use whitespace in a sanskrit sentence? :)

Indeed, before 200 odd years there were no whitespaces whatsoever within sanskrit texts sentences/ halfverses :-) Even avagraha was a rarity! Use of whitespace is an innovation that supposedly helps reading (dramatically increasing reading speed in Eurpean languages). In typical sanskrit orthography whitespaces started being inserted everywhere (roughly) except where non-visarga non-anusvAra sandhi-s appear between full pada-s. In prose and texts, I prefer extending this to every possible sandhi (everything except cases where one letter is substituted for two). In fact, I often place hyphens within samAsa-s. Many folks used to the neo-traditional orthography find this ugly and unnecessary; but others such as I find it natural.

codito commented 6 years ago

Thanks! I will close this issue. Will add the use case to scoring discussion in #84. It appears to be the best way to resolve this.

kmadathil commented 6 years ago

I think we should look into this - I will try when I can pull myself out of all of the stuff going on around me ;-)

At some point, I'm sure we did this right. We treated space as a hard hint of a pada boundary - and therfore that a split must happen at that point. kaScit naraH vA nArI should not result in kaH, cit, naraH, vAna, arI as a valid option, but "kaScinnarovAnArI" can. I am almost certain this cropped up because the code I pointed out above stopped working as intended once we moved to generating all splits at once.

kmadathil commented 6 years ago

@avinashvarna ta eva should still work, since the space is to be interpreted as a hard hint of a split, not to prevent lookahead to decide what that split should be.

avinashvarna commented 6 years ago

I think we need to decide on how we want to treat spaces.

  1. Must split at space
  2. Only split at space
  3. Use space as a marker to not split beyond the next space
  4. Something else?

The code above currently does 3, but the comments seem to indicate a leaning towards a combination of 1 and 2.

kmadathil commented 6 years ago

"Must split (of some kind) at space" was my intent.

Any cons to going that way?

On Sat, Jun 2, 2018, 12:15 PM Avinash Varna notifications@github.com wrote:

I think we need to decide on how we want to treat spaces.

  1. Must split at space
  2. Only split at space
  3. Use space as a marker to not split beyond the next space
  4. Something else?

The code above currently does 3, but the comments seem to indicate a leaning towards a combination of 1 and 2.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/kmadathil/sanskrit_parser/issues/91#issuecomment-394110163, or mute the thread https://github.com/notifications/unsubscribe-auth/AJRLNoqF-CMwiB_r50R7DAgCbADUApEPks5t4uRHgaJpZM4UPKTH .

avinashvarna commented 6 years ago

I can't think of any con immediately. I can think of a relatively simple fix to enforce this in the sandhi module by removing any splits that result in len(left string) > position of space (which is passed as an input to the split method). I will try to implement, test this and report back.

avinashvarna commented 6 years ago

I looked into this a little bit. This seems to be due to the recursive nature of the calls and the space replacement and not directly related to the sandhi module. E.g. say input is "AB CD EF". The first space replacement will try to split "ABCD EF". Suppose a split ("A", "BCD EF") is found where "A" is valid. The program will recursively try to split "BCD EF" next. The first space replacement logic will then replace the first space and try to split "BCDEF". Effectively, the string has become almost equivalent to "ABCDEF".

We could try to build on @codito's commit and handle the spaces in the sandhi module. Needs some more thought.

codito commented 6 years ago

I've pushed the commit to tokenize branch for future work.

kmadathil commented 5 years ago

Ok, I'm finally back in action (on a different continent and timezone). @codito - your fix seems to be correct. I had to make a minor addition to the sandhi rules to get strings like "harir iha" to work right. (@avinashvarna - I added a rule with a space)

kmadathil commented 5 years ago

Tests pass locally, but I see build errors on Travis-CI. @avinashvarna - any idea - this seems to be in sanskrit_util?

____ ERROR collecting tests/testSanskritLexicalAnalyer.py ____ tests/test_SanskritLexicalAnalyer.py:9: in from sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer import SanskritLexicalAnalyzer sanskrit_parser/lexical_analyzer/sanskrit_lexical_analyzer.py:10: in from sanskrit_parser.util.lexical_lookup_factory import LexicalLookupFactory sanskrit_parser/util/lexical_lookup_factory.py:8: in from sanskrit_parser.util.sanskrit_data_wrapper import SanskritDataWrapper sanskrit_parser/util/sanskrit_data_wrapper.py:11: in import sanskrit_util.analyze .tox/py36/lib/python3.6/site-packages/sanskrit_util/init.py:11: in from .context import Context .tox/py36/lib/python3.6/site-packages/sanskrit_util/context.py:20: in from .schema import Base, EnumBase, GenderGroup .tox/py36/lib/python3.6/site-packages/sanskrit_util/schema.py:357: in class PrefixedModifiedRoot(PrefixedRoot, ModifiedRoot): .tox/py36/lib/python3.6/site-packages/sqlalchemy/ext/declarative/api.py:65: in init _as_declarative(cls, classname, cls.dict) .tox/py36/lib/python3.6/site-packages/sqlalchemy/ext/declarative/base.py:116: in _as_declarative _MapperConfig.setupmapping(cls, classname, dict) .tox/py36/lib/python3.6/site-packages/sqlalchemy/ext/declarative/base.py:144: in setup_mapping cfgcls(cls, classname, dict_) .tox/py36/lib/python3.6/site-packages/sqlalchemy/ext/declarative/base.py:174: in init self._setup_inheritance() .tox/py36/lib/python3.6/site-packages/sqlalchemy/ext/declarative/base.py:509: in _setup_inheritance "Class %s has multiple mapped bases: %r" % (cls, inherits)) E sqlalchemy.exc.InvalidRequestError: Class <class 'sanskrit_util.schema.PrefixedModifiedRoot'> has multiple mapped bases: [<class 'sanskrit_util.schema.PrefixedRoot'>, <class 'sanskrit_util.schema.ModifiedRoot'>]

kmadathil commented 5 years ago

Workaround - forcing sqlalchemy to 1.0.11 fixes this. @avinashvarna , this will need a proper fix in sanskrit_util. Merging tokenize branch to master and closing this issue.