Heap running out-of-memory with bigquery insertLoadJob

0x80 commented 4 years ago

In the code below I inject some jobs to import data in BiqQuery from a Firestore export file. Since a few days my database has reached a point where this is creating a heap out of memory error in my deployed cloud function. 2GB of memory is not enough anymore to execute this code.

I assume there is a memory leak somewhere. I can not think of a reason why injecting a job pointing to a file in a storage bucket would have to use so much memory on the client-side.

I guess I could enhance the code by waiting for each job to finish before injecting a new one, but that would be silly because it would likely cause the cloud function to time out and clearly doesn't scale.

Environment details

Node.js version: 10 (firebase functions)
googleapis version: 47.0.0

Steps to reproduce

import {
  GoogleAuth,
  OAuth2Client
} from "google-auth-library";
import { google } from "googleapis";
import {
  getExportCollectionList,
  getLatestFirestoreExportUri
} from "./helpers";
import { isEmpty } from "lodash";

const bigquery = google.bigquery("v2");

const auth = new GoogleAuth({
  scopes: "https://www.googleapis.com/auth/cloud-platform"
});

export async function injectFirestoreExportInBigQuery() {
  const { projectId, credential } = await auth.getApplicationDefault();

  if (!projectId) {
    throw new Error(
      `Missing projectId from auth.getApplicationDefault() response`
    );
  }

  const collectionIds = await getExportCollectionList();
  const exportFolderUri = await getLatestFirestoreExportUri(projectId);

  console.log("Sourcing from export uri", exportFolderUri);

  const failures: Array<{ collectionId: string; text: string }> = [];

  for (const collectionId of collectionIds) {
    console.log("Inject dataset for collection", collectionId);
    const result = await insertLoadJob(
      projectId,
      credential,
      exportFolderUri,
      collectionId
    );
    if (result.status !== 200) {
      console.error(
        new Error(
          `Failed to schedule job for collection ${collectionId}: ${result.statusText}`
        )
      );
      failures.push({ collectionId, text: result.statusText });
    }
  }

  if (isEmpty(failures)) {
    console.log("Successfully scheduled BigQuery import jobs");
  } else {
    console.log("There were some failures:", failures);
  }
}

async function insertLoadJob(
  projectId: string,
  credential: OAuth2Client,
  exportFolderUri: string,
  collectionId: string
) {
  const dataUri = `${exportFolderUri}/all_namespaces/kind_${collectionId}/all_namespaces_kind_${collectionId}.export_metadata`;

  const payload = {
    jobReference: {
      location: "US"
    },
    configuration: {
      load: {
        sourceUris: [dataUri],
        sourceFormat: "DATASTORE_BACKUP",
        writeDisposition: "WRITE_TRUNCATE",
        createDisposition: "CREATE_IF_NEEDED",
        destinationTable: {
          datasetId: "firestore",
          projectId,
          tableId: collectionId
        }
      }
    }
  };

  const result = await bigquery.jobs.insert({
    projectId,
    auth: credential,
    requestBody: payload
  });

  return result;
}

bcoe commented 4 years ago

@schmidt-sebastian anything jumping out at you as weird here? I don't see anything client side which should necessarily being eating more memory as the upstream database grows?

schmidt-sebastian commented 4 years ago

I know very little about BigQuery's network stack, but it looks like they only use the low-level Firestore API which was never as prone to memory leaks as the main SDK. I would expect that the leak is somewhere in BigQuery's network stack, but I don't have any insights.

0x80 commented 4 years ago

Since I created the issue I have tried to run the insert jobs sequentially (waiting for the insert, not waiting the created job to finish) instead of parallel. It made no difference. I think the problem is not a memory leak, but the fact the the client somehow reads / validates all of the data being passed into the job.

This seems inefficient to me so I would still consider this a bug. Importing managed exports into big query by pointing jobs to a bucket should not put this amount of load on the client inserting the jobs IMO. But I think it's possible that this is by design and not considered a bug.

JustinBeckwith commented 4 years ago

Greetings @0x80! By chance, have you done a heap snapshot analysis? It would be good to know where the objects in your program are coming from. Without seeing the snapshot, it's impossible to know what's this module and what's other factors in your app.

One thing specifically that jumps out to me is this:

const collectionIds = await getExportCollectionList();

Any ideas on how big that list could get?

arorajatin commented 4 years ago

I had the same issue and realized that this only happens if I use ES6 import statements and doesn't happen if I use require. Is there a cyclic dependency somewhere that causes this OOM issue?

0x80 commented 4 years ago

@JustinBeckwith No, I haven't done an analysis like that. The collections list is never that big. We have about 20 root collections in total so it won't go over that.

steffnay commented 4 years ago

@JustinBeckwith @bcoe do you think that a cyclic dependency might be the cause of this, as mentioned by @arorajatin?

bcoe commented 4 years ago

@JustinBeckwith @bcoe do you think that a cyclic dependency might be the cause of this, as mentioned by @arorajatin?

I'm not quite sure what magic makes this work:

import {
  GoogleAuth,
  OAuth2Client
} from "google-auth-library";

☝️ I don't believe import statements were introduced until Node 12. @0x80 are you running a build step, and writing your application using babel or TypeScript?

Another thing worth digging into

How large of a set is returned by getExportCollectionList(), if this has grown into millions of entries, this seems like the most likely place for a memory leak.

0x80 commented 4 years ago

@bcoe My scripts are written in Typescript and executed by ts-node.

The getExportCollectionList function just returns a list of collection names that need to be exported based on blacklisting. The returned value is a list of strings in size of around 20, so that can't be it.

Here's the implementation:

export function getExportCollectionList() {
  const collectionsToExclude = ["__system", "emails"];

  return db.listCollections().then((collectionRefs) => {
    return collectionRefs
      .map((ref) => ref.id)
      .filter((id) => !collectionsToExclude.includes(id));
  });
}

bcoe commented 4 years ago

@0x80 just to eliminate something from the equation, could you use tsc and compile the script to JavaScript?

meredithslota commented 3 years ago

Closing due to lack of activity but feel free to reopen if the above steps don't help with more repro information. Thank you!

sourabh-4097 commented 2 years ago

Facing Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory error, when googleapis is instantiated in code

googleapis / google-api-nodejs-client

Heap running out-of-memory with bigquery insertLoadJob #1984

Environment details

Steps to reproduce