jeroen / mongolite

Fast and Simple MongoDB Client for R
https://jeroen.github.io/mongolite/
287 stars 65 forks source link

count() taking long time #153

Open EmilBode opened 6 years ago

EmilBode commented 6 years ago

I guess since the new count api was implemented, I'm having an issue where count() takes a long time to complete. My collection has 1739040 documents, of average size ~ 15kB, and runs in a docker setup. Using the mongo API-itself db.collection.count() returns right away, but when using the mongolite-interface it takes a few minutes. It looks like a full query of some kind is executed, instead of using stored metadata. I guess the new setup is more resistant to errors, but I'm often simply using count() as a quick check if I have a working connection to the right database, where I'd like a quick result. Is this a bug, or could we get back the old count?

Details of setup:

When the process is busy, db.currentOp() gives the following output:

{
    "inprog" : [
        {
            "host" : "2351dfabc6d4:27017",
            "desc" : "conn7",
            "connectionId" : 7,
            "client" : "127.0.0.1:45744",
            "appName" : "MongoDB Shell",
            "clientMetadata" : {
                "application" : {
                    "name" : "MongoDB Shell"
                },
                "driver" : {
                    "name" : "MongoDB Internal Client",
                    "version" : "3.6.5"
                },
                "os" : {
                    "type" : "Linux",
                    "name" : "PRETTY_NAME=\"Debian GNU/Linux 8 (jessie)\"",
                    "architecture" : "x86_64",
                    "version" : "Kernel 4.9.93-linuxkit-aufs"
                }
            },
            "active" : true,
            "currentOpTime" : "2018-09-10T09:03:20.710+0000",
            "opid" : 1596,
            "secs_running" : NumberLong(0),
            "microsecs_running" : NumberLong(84),
            "op" : "command",
            "ns" : "admin.$cmd.aggregate",
            "command" : {
                "currentOp" : 1,
                "$db" : "admin"
            },
            "numYields" : 0,
            "locks" : {

            },
            "waitingForLock" : false,
            "lockStats" : {

            }
        },
        {
            "host" : "2351dfabc6d4:27017",
            "desc" : "conn5",
            "connectionId" : 5,
            "client" : "172.17.0.1:37970",
            "appName" : "r/mongolite",
            "clientMetadata" : {
                "application" : {
                    "name" : "r/mongolite"
                },
                "driver" : {
                    "name" : "mongoc",
                    "version" : "1.12.0"
                },
                "os" : {
                    "type" : "Darwin",
                    "name" : "macOS",
                    "version" : "17.7.0",
                    "architecture" : "x86_64"
                },
                "platform" : "cfg=0x0216a8e9 posix=200112 stdc=201112 CC=clang 6.0.0 (tags/RELEASE_600/final) CFLAGS=\"\" LDFLAGS=\"\""
            },
            "active" : true,
            "currentOpTime" : "2018-09-10T09:03:20.710+0000",
            "opid" : 1509,
            "lsid" : {
                "id" : UUID("225e3d99-4ac5-4a1d-9488-9d8d02d5c820"),
                "uid" : BinData(0,"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=")
            },
            "secs_running" : NumberLong(59),
            "microsecs_running" : NumberLong(59064213),
            "op" : "command",
            "ns" : "MyDB.MyCol",
            "command" : {
                "aggregate" : "MyCol",
                "cursor" : {

                },
                "pipeline" : [
                    {
                        "$group" : {
                            "_id" : null,
                            "n" : {
                                "$sum" : 1
                            }
                        }
                    }
                ],
                "$db" : "MyDB",
                "$readPreference" : {
                    "mode" : "primaryPreferred"
                },
                "lsid" : {
                    "id" : UUID("225e3d99-4ac5-4a1d-9488-9d8d02d5c820")
                }
            },
            "planSummary" : "COLLSCAN",
            "numYields" : 5077,
            "locks" : {
                "Global" : "r",
                "Database" : "r",
                "Collection" : "r"
            },
            "waitingForLock" : false,
            "lockStats" : {
                "Global" : {
                    "acquireCount" : {
                        "r" : NumberLong(10158)
                    }
                },
                "Database" : {
                    "acquireCount" : {
                        "r" : NumberLong(5079)
                    }
                },
                "Collection" : {
                    "acquireCount" : {
                        "r" : NumberLong(5079)
                    }
                }
            }
        }
    ],
    "ok" : 1
}
epklein commented 6 years ago

Same problem here. A "count" operation that was working properly stopped working after update to R 3.5.1 and mongolite 2.0.

library(mongolite) m <- mongo(collection = "RealizedRouteCollection", db=mongoDB, url=mongoUrl) m$count() Error: Failed to send "aggregate" command with database "MongoDBDatabase": socket error or timeout

It was working properly before the updates. While the error persists I've found a workaround getting the count from the collection info, with the following code:

info <- m$info() return (info$stats$count)

This returns instantly.

EmilBode commented 6 years ago

Thanks for the workaround, I overlooked info()

jeroen commented 6 years ago

In the latest version of the c driver, the old mongoc_collection_count api was deprecated and now there are 2 api's to perform a count:

I am using the first one, but perhaps there should be an option to use the estimated count instead?

@ajdavis is it expected that mongoc_collection_count_documents is slow?

ajdavis commented 6 years ago

Good catch. Yes, it is expected that count_documents is slower than estimated_document_count.

https://github.com/mongodb/specifications/blob/55bb56bc9da4cb13d23380f1ca2dfc6dd93a845c/source/crud/crud.rst#count-api-details

jeroen commented 6 years ago

OK so I'll add an option to get the estimated count (I guess that's what users are expecting here).

ajdavis commented 6 years ago

I recommend you follow the same path as libmongoc and other MongoDB drivers: deprecate or remove "count", add "estimatedDocumentCount" and "countDocuments".

On Wed, Sep 12, 2018 at 10:19 AM Jeroen Ooms notifications@github.com wrote:

OK so I'll add an option to get the estimated count (I guess that's what users are expecting here).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jeroen/mongolite/issues/153#issuecomment-420665456, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFIheUJcnmGQ7BO7WuMNtdZsP4qg5uhks5uaRfqgaJpZM4WhBqt .

vorachet commented 5 years ago

Hello, Do you have any updates on this issue? Same problem here.

pfv07 commented 1 year ago

Hi I working in transactional collections and this have more than 10millons of documents. use a estimated count is very usefull. Jeroen, are you going to add this option ? thks, mongolite is great to R