Closed kasparsd closed 7 years ago
Splitting WXR files is typically done in order to make them processable by the Importer (v1), however isn't necessary at all in v2 due to the completely changed processing method. Splitting the files fundamentally changes them, and I'm not sure whether this is a use case we should support. Potentially, a tool to recombine the files could solve this.
Thanks for the feedback @rmccue! Has it been tested with 600,000 posts in one WXR file or a similar large number? I'm concerned about the memory requirements for such large imports.
We've tested on multi-gigabyte import files, and memory usage has stayed constant. The biggest struggles with large files are just keeping WordPress itself under control memory-wise.
In any case, I'd encourage trying it and finding out. The reason for v2 of the Importer is because I had a (relatively small) 40MB import taking >512MB of memory and I got sick of it. Being able to handle huge imports is a key design feature, due to the nature of the work we (HM) do.
(To be clear: multi-gigabyte WXR files, not including static files, which can be many multiples more.)
Sounds great @rmccue! So this particular issue relates only to the split export files that are currently used by WordPress VIP, for example. I'm closing this since it's not a common use-case. Just wanted to make sure that it has been noted somewhere.
The
post_process_posts()
applies only to the posts in the WXR file that is currently being imported. However, larger CLI-based imports normally contain several WXR files (with around 1000 posts each) and posts in the later batches that re-use attachments from the earlier attachments won't get the URLs replaced.With imports of 60,000+ attachments I've done this after the full import has completed and all the new attachment URLs are known. I'm not sure how to do this reliably for everyone, though.