humanmade / WordPress-Importer

In-development rewrite of the WordPress (WXR) Importer
Other
358 stars 63 forks source link

Fails to replace attachment URLs in the post content of posts outside the current import file #102

Closed kasparsd closed 7 years ago

kasparsd commented 7 years ago

The post_process_posts() applies only to the posts in the WXR file that is currently being imported. However, larger CLI-based imports normally contain several WXR files (with around 1000 posts each) and posts in the later batches that re-use attachments from the earlier attachments won't get the URLs replaced.

With imports of 60,000+ attachments I've done this after the full import has completed and all the new attachment URLs are known. I'm not sure how to do this reliably for everyone, though.

rmccue commented 7 years ago

Splitting WXR files is typically done in order to make them processable by the Importer (v1), however isn't necessary at all in v2 due to the completely changed processing method. Splitting the files fundamentally changes them, and I'm not sure whether this is a use case we should support. Potentially, a tool to recombine the files could solve this.

kasparsd commented 7 years ago

Thanks for the feedback @rmccue! Has it been tested with 600,000 posts in one WXR file or a similar large number? I'm concerned about the memory requirements for such large imports.

rmccue commented 7 years ago

We've tested on multi-gigabyte import files, and memory usage has stayed constant. The biggest struggles with large files are just keeping WordPress itself under control memory-wise.

In any case, I'd encourage trying it and finding out. The reason for v2 of the Importer is because I had a (relatively small) 40MB import taking >512MB of memory and I got sick of it. Being able to handle huge imports is a key design feature, due to the nature of the work we (HM) do.

rmccue commented 7 years ago

(To be clear: multi-gigabyte WXR files, not including static files, which can be many multiples more.)

kasparsd commented 7 years ago

Sounds great @rmccue! So this particular issue relates only to the split export files that are currently used by WordPress VIP, for example. I'm closing this since it's not a common use-case. Just wanted to make sure that it has been noted somewhere.