utk-se / WorldSyntaxTree

Language-agnostic parsing of World of Code repositories
Other
20 stars 0 forks source link

Further performance and RAM usage improvements: batch inserts #9

Closed robobenklein closed 3 years ago

robobenklein commented 3 years ago

Figured out how to a constant-time batch write with only two queries! (Python is still O(N) but N is capped at the size of a single file's nodes, which are cheap to iterate)

Since it is (theoretically) constant-time inserts now, this should mean that it's ready for analyzing all of github, please compare results in both performance and in terms of tree correctness.

I also added the preorder property, which is the Pre-Order series number in the depth-first tree traversal. This should allow us to develop a test suite to compare Tree Sitter trees from TreeSitterCursorIterator to data in neo4j directly. (by item to item sequence comparison)

robobenklein commented 3 years ago
  File "/home/robo/code/WorldSyntaxTree/wsyntree_collector/neo4j_collector_worker.py", line 67, in batch_insert_WSTNode                                                                                                                        
    nresults = tx.run(qi, {"entries": entries})                                                                                                                                                                                                
  File "/home/robo/code/WorldSyntaxTree/venv/lib/python3.8/site-packages/neo4j_driver-4.1.1-py3.8.egg/neo4j/work/transaction.py", line 118, in run                                                                                             
    result._tx_ready_run(query, parameters, **kwparameters)                                                                                                                                                                                    
  File "/home/robo/code/WorldSyntaxTree/venv/lib/python3.8/site-packages/neo4j_driver-4.1.1-py3.8.egg/neo4j/work/result.py", line 57, in _tx_ready_run                                                                                         
    self._run(query, parameters, None, None, None, **kwparameters)                                                                                                                                                                             
  File "/home/robo/code/WorldSyntaxTree/venv/lib/python3.8/site-packages/neo4j_driver-4.1.1-py3.8.egg/neo4j/work/result.py", line 101, in _run                                                                                                 
    self._attach()                                                                                                                                                                                                                             
  File "/home/robo/code/WorldSyntaxTree/venv/lib/python3.8/site-packages/neo4j_driver-4.1.1-py3.8.egg/neo4j/work/result.py", line 202, in _attach                                                                                              
    self._connection.fetch_message()                                                                                                                                                                                                           
  File "/home/robo/code/WorldSyntaxTree/venv/lib/python3.8/site-packages/neo4j_driver-4.1.1-py3.8.egg/neo4j/io/_bolt4x1.py", line 353, in fetch_message                                                                                        
    response.on_failure(summary_metadata or {})                                                                                                                                                                                                
  File "/home/robo/code/WorldSyntaxTree/venv/lib/python3.8/site-packages/neo4j_driver-4.1.1-py3.8.egg/neo4j/io/_bolt4x1.py", line 552, in on_failure                                                                                           
    raise Neo4jError.hydrate(**metadata)                                                                                                                                                                                                       
neo4j.exceptions.TransientError: {code: Neo.TransientError.Transaction.BookmarkTimeout} {message: Database 'top1k' not up to the requested version: 113071. Latest database version is 113054}

not sure if this is related to this PR or not, or if it's just an artifact of the db being unable to keep up yet caused by running analyze over linux with 128 workers

robobenklein commented 3 years ago

not to be merged until https://github.com/neo4j/neo4j/issues/12686 is solved