ravendb / ravendb-python-client

This is the official python client for RavenDB document database
MIT License
35 stars 22 forks source link

How to start up in python #137

Closed ethanperrine closed 2 years ago

ethanperrine commented 2 years ago

Im playing around with NoSQL databases, how would I query for something and have it print the results?

ayende commented 2 years ago

Something like this:

store = DocumentStore(urls, db_name)
store.initialize()

with self.store.open_session() as session:
    p = session.load("products/1-A")
    print(p)
ethanperrine commented 2 years ago

Hey Thank you! However I am getting this error: "pyravendb.custom_exceptions.exceptions.InvalidOperationException: The maximum number of requests (30) allowed for this session has been reached. Raven limits the number of remote calls that a session is allowed to make as an early warning system. Sessions are expected to be short lived, and Raven provides facilities like batch saves (call save_changes() only once). You can increase the limit by setting DocumentConventions. MaxNumberOfRequestsPerSession or MaxNumberOfRequestsPerSession, but it is advisable that you'll look into reducing the number of remote calls first, since that will speed up your application significantly and result in a more responsive application."

ayende commented 2 years ago

Yes, that is by design. The issue is that you are expected to use the session for a short duration (such as a single request).

Can you explain what you are doing? Also note that you can set the limit yourself.

ethanperrine commented 2 years ago

basically I'm inputting a file, and checks if each line of the file is in the data base or not and returns the result.

ethanperrine commented 2 years ago
import logging
import sys
from typing import Any
import easygui
from pyravendb.store import document_store

logging.basicConfig(level=logging.ERROR, filename="Testing.log", format="%(levelname):%(message)s")

# write code that opens a new window with easygui and asks for a
f = open(easygui.fileopenbox("Import your data here")).readlines()

# write code to query in raven DB
def print_err(err: str) -> None:
    print(f"Check logs ! ")
    logging.error(f"{err}")
    sys.exit(0)

def create_session() -> Any:
    """ Creates a RavenDB session, but for testing, also queries"""

    store = document_store.DocumentStore(urls=['http://127.0.0.1:8080'], database='Testing')
    store.conventions.max_number_of_requests_per_session = 1000
    store.initialize()
    store.conventions.max_number_of_requests_per_session = 1000
    with store.open_session() as session:
        store.MaxNumberOfRequestsPerSession = 1000
        for line in f:
            if ":" in line:
                line = line.split(":")
                foohash = line[1].strip()
            else:
                foohash = line.strip()

            fillerr = line[0].strip()
            query = session.query().where(hash=foohash)
            for doc in query:
                try:
                    print(f"{fillerr}:" + f"{str(doc.plaintext)}")
                except Exception as err:
                    print(f"{err}")
                    print_err(f"In opening a session -> {err}")

session: Any = create_session()
ayende commented 2 years ago

Okay, so the issue is that you are querying once for each line, in the same session. You can probably batch this, but the simplest option is to do:


for line in f:
      with store.open_session() as session:
ethanperrine commented 2 years ago

What would be a more efficient way of doing this on server resources?

ethanperrine commented 2 years ago

also looking through billions of lines takes a while, any possible way to speed up the query?

ayende commented 2 years ago

You are issuing a query per line, that would take a while. A better option it to get 1024 lines, and query them all at once, like so:

for lines_batch in batch(f, 1024):
     hashes = get_hashes_from_batch(lines_batch)
     with store.open_session() as session:
          query = session.query().where_in(hash=hashes)
...
ethanperrine commented 2 years ago

so I understand RavenDB uses a B-Tree for indexing efficiently, should I import the sorted files? Or will RavenDB automatically sort the files in a B Tree Format?

ethanperrine commented 2 years ago

Also I tried adding the batch into the code, but it only returns 1 output.

ayende commented 2 years ago

RavenDB actually uses several different methods internally. Working with sorted data will make some things easier, yes. But probably not mandatory.

As for getting 1 output, you should be getting a list with all the matches for the query.

ethanperrine commented 2 years ago

im assuming i wrote this correctly.

import logging
import sys
from typing import Any
import easygui
from pyravendb.store import document_store

def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

# subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'batch'])
logging.basicConfig(level=logging.ERROR, filename="Testing.log", format="%(levelname):%(message)s")

# write code that opens a new window with easygui and asks for a file
f = open(easygui.fileopenbox("Import your data here")).readlines()

fw = open("dehashed.txt", "w")

def print_err(err: str) -> None:
    print(f"Check logs ! ")
    logging.error(f"{err}")
    sys.exit(0)

def create_session() -> Any:
    """ Creates a RavenDB session, but for testing, also queries"""

    store = document_store.DocumentStore(urls=['http://127.0.0.1:8080'], database='Testing')
    store.conventions.max_number_of_requests_per_session = 100000000
    store.initialize()
    # store.conventions.max_number_of_requests_per_session = 100000000

    # store.MaxNumberOfRequestsPerSession = 100000000
    # write a way to query 1024 lines batch at a time
    for lines_batch in batch(f, 3):
        # print(lines_batch)
        # hashes = get_hashes_from_batch(lines_batch)
        with store.open_session() as session:
            for line in lines_batch:
                if ":" in line:
                    line = line.split(":")
                    foohash = line[1].strip()
                else:
                    foohash = line.strip()

              email = line[0].strip()
              query = session.query().where(hash=foohash)
              for doc in query:
                  try:
                      print(f"{email}:" + f"{str(doc.plaintext)}")
                      fw.write(f"{email}:" + f"{str(doc.plaintext)}\n")
                  except Exception as err:
                      #print(f"{err}")
                      print_err(f"In opening a session -> {err}")

# fw.close()

session: Any = create_session()
ayende commented 2 years ago

Your batches are very small (3 items). I would recommend to go to 1000. foohash is a string, no? You need to use a where_in and pass a list there.

      hashes = []
      for line in lines_batch:
                if ":" in line:
                    line = line.split(":")
                    foohash = line[1].strip()
                else:
                    foohash = line.strip()
                hashes.append(foohash)
       query = session.query().where_in(hash=hashes)`
ethanperrine commented 2 years ago
    for lines_batch in batch(f, 30):
        # print(lines_batch)
        # hashes = get_hashes_from_batch(lines_batch)
        with store.open_session() as session:
            hashes = []
            for line in lines_batch:
                if ":" in line:
                    line = line.split(":")
                    foohash = line[1].strip()
                else:
                    foohash = line.strip()

                hashes.append(foohash)
                query = session.query().where_in(hash=hashes)
                email = line[0].strip()
                for doc in query:
                    data = str(doc.plaintext)
                    try:
                        print(f"{email}:" + f"{data}")
                        fw.write(f"{email}:" + f"{data}\n")
                    except Exception as err:
                        # print(f"{err}")
                        print_err(f"In opening a session -> {err}")

So This is the code now, however im getting

query = session.query().where_in(hash=hashes)
TypeError: Query.where_in() got an unexpected keyword argument 'hash'

hash is already defined in the database. What could be wrong?

ayende commented 2 years ago

Sorry, needs to be:

query = session.query().where_in("hash", hashes)