kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

blockData – Error: Vector memory exhausted (limit reached?) #72

Open itsmevictor opened 1 year ago

itsmevictor commented 1 year ago

For context, I'm using a MacBook Pro with 32GB RAM and 512GB SSD.

My general goal is to run fastLink to link one dataframe with ~2000 rows to another that has ~45M rows. Since this will be a pretty lengthy operation, I have chosen to first get it to work for a subset of the 45M rows dataframe, and then generalizing when all will work. I have thus taken a 3M rows sample of that dataframe. Since the variables gender (character class) and birth_dates (date class) are exactly matching between the two dataframes, I have chosen to first block on these variables, and then run a loop that applies fastLink to each block.

My code for the block is the following:

block_out <- blockData(dfA = all_candidats_results,
                       dfB = sample_listes,
                       varnames = c("sexe", "date_naissance"),
                       n.cores = 8)

However, I get the following error & warnings:

Error: vector memory exhausted (limit reached?)
In addition: Warning message:
In asMethod(object) :
  sparse->dense coercion: allocating vector of size 41.7 GiB

I understand that this a memory issue, but I am a bit surprised, since the operation should not be that intensive. Can anyone help? I am also very open to any comments/suggestions as to how best use fastLink to link my two dataframes. Many thanks in advance and thanks to the developers!

aalexandersson commented 1 year ago

I routinely use fastLink with a similar sized linkages (that is, 1000s * 3-4 million records) on a similarly specced desktop without any issues.

Possible data cleaning (attribute alignment) issue: Is the birth date variable formatted in the same way in both datasets? The date variable could be character class, which sometimes is less difficult than dates, if you are only doing blocking on the variable.

Possible blocking issue: How many blocks do you have after the blocking? How many rows are in the smallest block? How many rows are in the largest block? I routinely successfully use 2-6 rather evenly sized blocks for such sized linkages. In your case, the date variable has too many possible values to be useful for blocking, I guess. I assume you will need to use much larger blocks instead of date of birth, for example "birth year window (<= 5 blocks)".

Possible linkage issue: Does the error and warning come after the blocking or later, that is, after the linkage? If after the linkage, which R/fastLink code did you use for the linkage?

tedenamorado commented 1 year ago

Hi,

To construct your blocks, you are performing exact matching on sexe and date_naissance. The problem is by loading the larger dataset, you are exhausting a lot of memory resources.

One possible solution is to create subsets by sexe and save the resulting objects independently. Then try to make blocks by date_naissance within each subgroup of sexe. By loading the subsets by sexe you would be allocating memory resources more efficiently.

Keep us posted!

Ted

itsmevictor commented 1 year ago

@aalexandersson Many thanks for your quick response.

  1. They are both in Date format, like this "YYYY-MM-DD". Out of curiosity, I tried converting them to character and then rerunning the blocking algorithm, like this:
sample_listes <- all_listes |> 
  sample_n(3000000) |> 
  mutate(date_naissance = as.character(date_naissance))

all_candidats_results <- all_candidats_results |> 
  mutate(date_naissance = as.character(date_naissance))

(sample_listes |> sample_n(1))$date_naissance
[1] "1992-09-28"
(all_candidats_results |> sample_n(1))$date_naissance
[1] "1972-04-28"

which shows that after converting to character, the dates format remain similar between the two dataframes. I tried running the same blocking algorithm, like this

block_out <- blockData(dfA = all_candidats_results,
                       dfB = sample_listes,
                       varnames = c("sexe", "date_naissance"),
                       n.cores = 8)

However, this ran for more than one hour so I just gave up as running it with Date format was much faster, which I interpret as suggesting that using dates in Date format is more efficient than with characters (hoping I'm not wrong?).

  1. You might be correct that the problem lies (at least partially) in the number of blocks. If I run the same code as above with 2M observations with the dates in Date format (which works and runs in like 7 minutes, I only get the warning but not the error), I end up with 1928 blocks. The longest element (within a block) is about 50 individuals. If the problem is indeed that there are too many blocks and that I should aim to have less blocks, one solution would be to convert the birthdates into a number of days (from then to today), and then run a numeric block allowing for a window… and iterating until I have a window that is acceptable (i.e. not too large so that the fastLink works smoothly, but still allows me to run my blocking operation), what do you think?

  2. The error & warning come after the blocking, not after the linkage.

I just saw @tedenamorado's answer (thank you for getting back to me!), so I'm going to try that too, and I will let you know.

aalexandersson commented 1 year ago

Yes, in my experience, 1928 blocks would be way too many blocks for ~2000 rows. I suggested max 10 blocks. In general, more blocking will result in a faster linkage but at the possible expense of more missed matches (false negatives).

Make sure you do not have more blocks than the smallest block size. Otherwise, you will get an error once you get to the linkage step and loop over the blocks. For an example of linkage with loop over the blocks, see my comment in issue 63.

itsmevictor commented 1 year ago

@tedenamorado Your suggestion is quite good, but unfortunately did not entirely solve the issue. I subsetted the 45M lines dataframe for the two possible values of my sex variable, and then took a 5M lines sample of the "male" subset. I then ran :

block_male <- blockData(dfA = all_candidats_results,
                        dfB = male_sample,
                        varnames = "date_naissance",
                        n.cores = 8) 

and after letting it run for one hour, I had to leave so I broke it. I will try again later but is that the expected timeline? In other words, did I make a mistake somewhere, or is more than 1 hour for such a block normal? With a 3M lines sample, I think it ran in 5 minutes and the warning message I got was sparse->dense coercion: allocating vector of size 20.9 GiB, which is still better than when I blocked on gender & birthdate since that did not run at all. Nonetheless, and without considering the time issue just mentioned, I fear that I am not going to be able to extend this approach to two ~22M rows dataframe (one for males, one for females). I'm all ears if you have any suggestion, and of course, I thank you again for your help!

@aalexandersson I take good note of your comments and your reference to your own code (which looks great & will most likely be of great help when I get to that step) and I will thus try to get fewer blocks by increasing the size of each of them, by transforming my dates into days and then using window.block and window.size. However, I do not know if this will contribute to solving the vector memory issue that arises at the blockDatastep. I'm going to try and will let both of you know.

aalexandersson commented 1 year ago

I will thus try to get fewer blocks by increasing the size of each of them,

No, only increase the size of the small samples. You probably instead want to decrease the size of the large samples since the large samples are the most memory consuming. If a 3M lines sample runs in 5 minutes and a 5M lines sample takes over an hour to block on, then a 3M lines sample sounds like a reasonable upper size limit to me.

itsmevictor commented 1 year ago

I'm not sure we are talking of the same thing. When I said I would aim to increase the size of the blocks, I was talking about the number of rows for each df.inds, which is determined by how restrictive the block condition is unless I'm mistaken, but which does not appear to be what you are talking about. From my understanding, if instead of blocking on each particular dates, I block on "month" (for instance), I will get fewer blocks – am I correct?

However and more directly related to your comment, I am not too sure I see how I could practically run things by decreasing the size of the samples – the way I see it is that regardless of how restrictive my blocking condition is, if I choose to block sample by sample, I would first have to run the blocking algorithm $n$ times where $n$ is the number of samples of my original 45M rows dataframe which would then yield $n$ lists of blocks, and then for each block $i$ constituted of a pair of dataframes $j_1, j_2$ (which are extracts of the inputs dfA and dfB) within each list, I would have to run the fastLink algorithm, before then combining everything to get the matches. That seems a little tedious but maybe I misunderstood your point, or am just stupid?

:-)

tedenamorado commented 1 year ago

Hi,

One possibility would be to remove from the larger dataset all the dates that do not appear in the smaller one (and the other way around). That way, you avoid the problem of creating blocks when there is only one observation in one dataset and not the other.

All my best,

Ted

aalexandersson commented 1 year ago

Ted's "filter" approach can be used also for larger units such as year instead of dates (if some years will only appear in one dataset and if you want fewer blocks than when using dates). Also, remove any missing values for the blocking if applicable.

itsmevictor commented 1 year ago

My apologies for getting back to you just now whereas you have both been of great help!

Basically, @tedenamorado your solution was spot-on and worked perfectly - I don't know why I did not think of it sooner. Filtering the large dataset to exclude all the birth dates that were not in the small dataset reduced its size considerably, and made everything easier. Thank you so so much, both for developing the package & being so available to help.

@aalexandersson Thank you too for all your comments and suggestions. Most importantly, the code you pointed me to (in a different issue) worked great, and I was able to run it without any issues (after a few modifications but minor), including for blocks that were between 1200-1500 elements long.