multinet-app / multinet-server

Multinet server application
https://multinet-app.readthedocs.io
Apache License 2.0
4 stars 2 forks source link

Avoid coercing iterators to list #408

Closed waxlamp closed 2 years ago

waxlamp commented 4 years ago

See here for an example.

The fear is that coercing to list may blow up memory for a query with a very large output set, etc.

mvandenburgh commented 4 years ago

The cursor object returned by the database query is run through a couple validation functions before being inserted into the database, so the cursor will be empty after the first validation as @AlmightyYakob mentioned in https://github.com/multinet-app/multinet-server/pull/404#discussion_r446146104.

One solution I thought of is to validate and insert each row individually, but that would require rewriting all the validation functions. Is that adding too much complexity to this issue?

jjnesbitt commented 4 years ago

One solution I thought of is to validate and insert each row individually, but that would require rewriting all the validation functions. Is that adding too much complexity to this issue?

We've already bounced this idea around in our heads as something we might want to try, so if you wanted to attempt to tackle that issue, I think that'd provide substantial value to the project. I'd say technically that's larger in scope than this issue is concerned about, but would probably solve this issue in the process, so I'm okay with it.

Any thoughts @waxlamp?

waxlamp commented 4 years ago

One solution I thought of is to validate and insert each row individually, but that would require rewriting all the validation functions. Is that adding too much complexity to this issue?

I believe inserting each row individually is a non-starter, for a couple of reasons:

  1. That will be way too slow in general. We already want to explore using an even faster method of "bulk insertion" to make our database update process even faster (see multinet-app/multinet-api#114).
  2. Besides slowness, validation has to generally be done globally for a dataset. As each row hits the validator, it adds to a global record about things such as unique keys, etc. So we need some way to stream the whole dataset past the validator before we can stream it past the bulk loader. I have some ideas for this (see ).

We've already bounced this idea around in our heads as something we might want to try, so if you wanted to attempt to tackle that issue, I think that'd provide substantial value to the project. I'd say technically that's larger in scope than this issue is concerned about, but would probably solve this issue in the process, so I'm okay with it.

Any thoughts @waxlamp?

I agree that thinking about this doesn't "add too much complexity" as Mike put it. However, it might make sense to open a PR for any of the low-hanging fruit cases (if there are any) and open a new issue to address the validation iterator problem specifically.