ramses-tech / nefertari

Nefertari is a REST API framework sitting on top of Pyramid and ElasticSearch
Apache License 2.0
53 stars 19 forks source link

nefertari.index and re-indexation #148

Open Ezka77 opened 8 years ago

Ezka77 commented 8 years ago

Hi, I'm stuck in an issue with nefertari.index command: --recreate seems to work but it makes the indexation of only 10k rows.

I've tried to play with the --chunk option: doesn't change a thing (see exemple below).

I've tried to re-execute the command with --models on a ~12k rows table but:

# nefertari.index -c docker.ini --models Label
2016-10-28 15:39:49,283 - nefertari.elasticsearch - Including Elasticsearch. {'chunk_size': 500, 'index.disable': 'false', 'enable_aggregations': 'false', 'enable_refresh_query': 'false', 'enable_polymorphic_query': 'false', 'sniff': 'false', 'hosts': 'elasticsearch:9200', 'index_name': 'hathor'}
2016-10-28 15:39:49,283 - root - Indexing models documents
2016-10-28 15:39:49,283 - root - Processing model `Label`
2016-10-28 15:39:49,940 - root - Indexing missing `Label` documents
2016-10-28 15:39:49,940 - nefertari.elasticsearch - Trying to index documents of type `Label` missing from `hathor` index
2016-10-28 15:39:50,117 - elasticsearch - GET http://elasticsearch:9200/hathor/Label/_mget?fields=_id [status:200 request:0.173s]
2016-10-28 15:39:50,152 - nefertari.elasticsearch - No documents of type `Label` are missing from index `hathor`

Same command with a --chunk 1000 doesn't change a thing:

# nefertari.index -c docker.ini --models Label --chunk 1000
2016-10-28 15:41:18,802 - nefertari.elasticsearch - Including Elasticsearch. {'index_name': 'hathor', 'sniff': 'false', 'hosts': 'elasticsearch:9200', 'enable_refresh_query': 'false', 'enable_polymorphic_query': 'false', 'enable_aggregations': 'false', 'index.disable': 'false', 'chunk_size': 500}
2016-10-28 15:41:18,803 - root - Indexing models documents
2016-10-28 15:41:18,803 - root - Processing model `Label`
2016-10-28 15:41:19,453 - root - Indexing missing `Label` documents
2016-10-28 15:41:19,453 - nefertari.elasticsearch - Trying to index documents of type `Label` missing from `hathor` index
2016-10-28 15:41:19,618 - elasticsearch - GET http://elasticsearch:9200/hathor/Label/_mget?fields=_id [status:200 request:0.161s]
2016-10-28 15:41:19,654 - nefertari.elasticsearch - No documents of type `Label` are missing from index `hathor`

In the end with this exemple I have on postgres:

select count(*) from label;
 count
-------
 12371
(1 row)

And from my API:

http http://myawesomeserver/api/Labels
{
start: 0,
fields: "",
data: [...],
count: 20,
took: 2,
total: 10000
}

Which is not consistent =s

NB: A freeze of my env:

attrs==16.2.0
blinker==1.4
click==6.6
cryptacular==1.4.1
elasticsearch==1.7.0
hathor==0.0.1
inflection==0.3.1
jsonref==0.1
markdown2==2.3.1
nefertari==0.7.0
nefertari-sqla==0.4.2
Paste==2.0.2
PasteDeploy==1.5.2
pbkdf2==1.3
psycopg2==2.6.2
pyramid==1.5.7
pyramid-sqlalchemy==1.6
pyramid-tm==1.0.1
PyYAML==3.12
ramlfications==0.1.8
ramses==0.5.3
repoze.lru==0.6
requests==2.11.1
simplejson==3.10.0
six==1.10.0
SQLAlchemy==1.1.3
SQLAlchemy-Utils==0.32.9
Tempita==0.5.2
termcolor==1.1.0
transaction==1.6.1
translationstring==1.3
urllib3==1.18.1
venusian==1.0
waitress==0.8.9
WebOb==1.6.2
xmltodict==0.10.2
zope.deprecation==4.1.2
zope.dottedname==4.1.0
zope.interface==4.3.2
zope.sqlalchemy==0.7.7
postatum commented 8 years ago

Hi @Ezka77.

Try running nefertari.index -c docker.ini --models Label --recreate --params=_limit=NUM where NUM is number greater than a number of Label items in your db.

Ezka77 commented 8 years ago

Hi @postatum

Humm --params= seems a bit undocumented but now I remember I've used it last time this case happened. Ok I see the idea but the command end on this error: nefertari.index: error: argument --recreate: not allowed with argument --models

If I remember correctly --recreate replace the --force option, but it seems doing a bit more.

So I've run the command whithout --recreate should be ok, only ~3k rows are missing, but here the traceback:

2016-11-02 09:40:01,906 - elasticsearch - GET http://elasticsearch:9200/hathor/Label/_mget?fields=_id [status:200 request:0.588s]
Traceback (most recent call last):
  File "/usr/local/bin/nefertari.index", line 9, in <module>
    load_entry_point('nefertari==0.7.0', 'console_scripts', 'nefertari.index')()
  File "/usr/local/lib/python3.5/site-packages/nefertari/scripts/es.py", line 23, in main
    return command.run()
  File "/usr/local/lib/python3.5/site-packages/nefertari/scripts/es.py", line 123, in run
    self.index_models(model_names)
  File "/usr/local/lib/python3.5/site-packages/nefertari/scripts/es.py", line 103, in index_models
    es.index_missing_documents(documents)
  File "/usr/local/lib/python3.5/site-packages/nefertari/elasticsearch.py", line 358, in index_missing_documents
    self._bulk('index', documents, request)
  File "/usr/local/lib/python3.5/site-packages/nefertari/elasticsearch.py", line 318, in _bulk
    operation=operation)
  File "/usr/local/lib/python3.5/site-packages/nefertari/elasticsearch.py", line 269, in process_chunks
    if count < chunk_size:
TypeError: unorderable types: int() < str()

I guess a string convertion is missing somewhere. Found a real bug this time =).

Well with luck str() < str() should be ok, i've run this one: nefertari.index -c docker.ini --models Label --params=_limit=13000 --chunk=500 And it worked.

Last one: I've a table with ~3,200,000 rows and my little server have only 15GB of RAM ... if I do some simple math: 2G for postgres, 2G for ES (a dump of postgres is about 2GB) ... well 10G for the re-indexing process seems fair ? well not at all ! I'll try to manage this lack of RAM with some more swap but I'm afraid of the time it'will need to do the job.

I hit a timeout Error above 1 million of rows, is there a way to increase the timeout limit ?

May I recommend to find a way to avoid search RAM comsumption ? =D I know my tables are not well optimized but maybe there is a way to process each chunks/offset/pages and release some RAM ? Again, I know it's hard to produce code and being "model-agnostic" ; maybe you just can't.

Ezka77 commented 8 years ago

Hi,

I wrote a quick & dirty script to index all postgres data in ES whithout running out of memory and no matter the table size. As it's mostly inspired from the nefertari index script below the code snippet. It should work no matter your table design as it works with nefertari abstraction (NB: I'm using ramses too not tested whithout).

Features a "delete mapping" directive: removing a mapping delete all documents from an index.

I rely on ES for the indexation process: if I push an already known documents nothing should happen (correct me if i'm wrong) and it's really fast on ES.

Code here: https://github.com/Ezka77/nefertari-manage-index/blob/master/manage_index.py