open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

Dedup with UMIs #186

Closed gdolsten closed 1 year ago

gdolsten commented 1 year ago

--extra-col-pair Extra columns that also must match for two pairs to be marked as duplicates. Can be either provided as 0-based column indices or as column names (requires the “#columns” header field). The option can be provided multiple times if multiple column pairs must match. Example: –extra-col-pair “phase1” “phase2”. [output format option]

If I have added UMIs as a column, can I pass this into the extra-col-pair? I.e., pairtools dedup ... --extra-col-pair 8 ...

Phlya commented 1 year ago

I think it should work! You just have to try... Probably using integer UMI IDs would be faster than strings though.

gdolsten commented 1 year ago

Great, thanks. Do you also know what the upper/lower case here means?

My pairtools QC returns these different types of read pairs. Do you know what the upper/lower case from pairtools stats means?

Index(['total', 'total_unmapped', 'total_single_sided_mapped', 'total_mapped', 'total_dups', 'total_nodups', 'cis', 'trans', 'pair_types/MM', 'pair_types/Mm', 'pair_types/NM', 'pair_types/NN', 'pair_types/Nm', 'pair_types/Nn', 'pair_types/mM', 'pair_types/mm', 'pair_types/nM', 'pair_types/nN', 'pair_types/nm', 'pair_types/nn', 'pair_types/Mu', 'pair_types/MU', 'pair_types/MR', 'pair_types/NR', 'pair_types/NU', 'pair_types/nU', 'pair_types/mU', 'pair_types/Nu', 'pair_types/mu', 'pair_types/nu', 'pair_types/UU', 'pair_types/DD', 'pair_types/UR', 'pair_types/uu', 'pair_types/RU', 'pair_types/Uu', 'pair_types/uU',]

gdolsten commented 1 year ago

Hi, just wanted to ping this!

Phlya commented 1 year ago

I think lower case types come from rescued walks, is that correct @agalitsyna ?

gdolsten commented 1 year ago

I thought that "R" stood for rescued.

agalitsyna commented 1 year ago

Lower letters mean the chimeric reads, where the read continuity on the genome has been disrupted. Lower letter types will be reported for regimes "5any", "5unique", "3any", and "3unique" (but not "mask" or "all"; it also applies for the pairtools >=1.0.2, for much earlier versions, there was no "all" regime). In these regimes, U alignment can be reported as chimeric. It does not make much difference to treat lower-letter pair types differently in the multiqc report.

gdolsten commented 1 year ago

I see, so if we get a complex "walk" then with my current 5unique walks policy we will report all of the consecutive ligation junctions. Since each of these are UNIQUE they will get a U, but it will be lower case since it is from a walk.

gdolsten commented 1 year ago

The tutorial code has the following pairtools select line:

pairtools select --nproc-in {threads} --nproc-out {threads} -o {output} '(pair_type=="UU") or (pair_type=="UR") or (pair_type=="RU")' {input}

Do you recommend modifying this to include uu, uR, Ru, Ur, etc?

gdolsten commented 1 year ago

@agalitsyna do you have any thoughts on the above ?

agalitsyna commented 1 year ago

I'm not entirely sure what will be the best case for you. The decision shall be based on (1) how large is the pairs file, (2) what are the advantages of using 5unique mode (maybe you can switch to some other more popular option of pairtools, if there are no other advantages; all is currently the best-recommended option as it recovers the most Hi-C/uC contacts), and (3) how much more in the future you anticipate using this particular mode on large files.

A. If the file is small, I would modify the pairtools select command to include all lowercase combinations (I would run awk '{print $8}' | uniq first to get an idea of what combinations are relevant). I like this solution as it will record your action in the .pairs header, and you can run this command in your pipelines again.

B. If your pipeline is more complicated and you anticipate running the statistics with different tools, I recommend re-generating the file with forced upper-case for the 8th column with the pair types. Then you can use this file with any tools working on .pairs file. The example command: grep -v '^#' | awk '{ print $1, $2, $3, $4, $5, $6, $7, toupper($8) }' (modify to have the number of columns that you have in your .pairs file; also the header shall be included independently at the beginning of the file). _I don't like this solution as it forces you to do non-standard manipulation on the file that will make your pipeline more dirty. Also, this manipulation won't appear in the header of the .pairs file as part of the history tracking that we implemented in pairtools._

C. If you think you have an important use case of pairtools and your pipeline is losing performance due to lower-case encoding of pair types, we can think of implementing an option to switch this lower-case mode off. I personally like this idea the most, as it makes pairtools better suited for manipulating multiple data formats. However, it also increases the number of potentially unnecessary parameters and generally increases the support management in the future.

@gdolsten Can you let me know your thinking and how much you will benefit from any of these solutions?

agalitsyna commented 1 year ago

Hi, I hope this helps and your issue got resolved! Feel free to re-open if you have suggestions or related issues.

gorliver commented 9 months ago

I'm not entirely sure what will be the best case for you. The decision shall be based on (1) how large is the pairs file, (2) what are the advantages of using 5unique mode (maybe you can switch to some other more popular option of pairtools, if there are no other advantages; all is currently the best-recommended option as it recovers the most Hi-C/uC contacts), and (3) how much more in the future you anticipate using this particular mode on large files.

A. If the file is small, I would modify the pairtools select command to include all lowercase combinations (I would run awk '{print $8}' | uniq first to get an idea of what combinations are relevant). I like this solution as it will record your action in the .pairs header, and you can run this command in your pipelines again.

B. If your pipeline is more complicated and you anticipate running the statistics with different tools, I recommend re-generating the file with forced upper-case for the 8th column with the pair types. Then you can use this file with any tools working on .pairs file. The example command: grep -v '^#' | awk '{ print $1, $2, $3, $4, $5, $6, $7, toupper($8) }' (modify to have the number of columns that you have in your .pairs file; also the header shall be included independently at the beginning of the file). _I don't like this solution as it forces you to do non-standard manipulation on the file that will make your pipeline more dirty. Also, this manipulation won't appear in the header of the .pairs file as part of the history tracking that we implemented in pairtools._

C. If you think you have an important use case of pairtools and your pipeline is losing performance due to lower-case encoding of pair types, we can think of implementing an option to switch this lower-case mode off. I personally like this idea the most, as it makes pairtools better suited for manipulating multiple data formats. However, it also increases the number of potentially unnecessary parameters and generally increases the support management in the future.

@gdolsten Can you let me know your thinking and how much you will benefit from any of these solutions?

Hi @agalisyna, I found using 5unique actually generates more valid pairs than all in my data, likely due to short fragments during the library construction and both reads of a pair contain ligation junctions. Here is an example of two reads:

350210693L1C001R00100000816    65      Chr3    59259905        60      42M58S  =       155481359       96221455        GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATT    HHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBA    NM:i:0  MD:Z:42 MC:Z:81M19S     AS:i:42 XS:i:29 SA:Z:Chr2,207068753,+,41S59M,0,0;
V350210693L1C001R00100000816    2113    Chr2    207068753       0       41H59M  Chr3    155481359       0       CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATT     CHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBA     NM:i:0  MD:Z:59 MC:Z:81M19H     AS:i:59 XS:i:59 SA:Z:Chr3,59259905,+,42M58S,60,0;
V350210693L1C001R00100000816    129     Chr3    155481359       60      81M19S  =       59259905        -96221455       AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGAC    CF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCC    NM:i:0  MD:Z:81 MC:Z:42M58S     AS:i:81 XS:i:50 SA:Z:Chr1,67516115,+,77S23M,0,0;
V350210693L1C001R00100000816    2177    Chr1    67516115        0       77H23M  Chr3    59259905        0       GAAATTACTCCAAAAAAGAGGAC G#E"@@*E?3FAE#.B.C=)FCC NM:i:0  MD:Z:23 MC:Z:42M58H     AS:i:23 XS:i:23 SA:Z:Chr3,155481359,+,81M19S,60,0;
V350210693L1C001R00100000903    65      Chr3    27240151        50      56M44S  =       25648197        -1591955        CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATAT    DCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?D    NM:i:0  MD:Z:56 MC:Z:100M       AS:i:56 XS:i:41 SA:Z:Chr1,21357395,-,45M55S,0,0;
V350210693L1C001R00100000903    2129    Chr1    21357395        0       45M55H  Chr3    25648197        0       ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACA   D?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??A   NM:i:0  MD:Z:45 MC:Z:100M       AS:i:45 XS:i:45 SA:Z:Chr3,27240151,+,56M44S,50,0;
V350210693L1C001R00100000903    129     Chr3    25648197        51      100M    =       27240151        1591955 CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAA    ED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E)    NM:i:0  MD:Z:100        MC:Z:56M44S     AS:i:100        XS:i:80

Using all, I got:

V350210693L1C001R00100000816    Chr3    59259905        !       0       +       +       UM      V350210693L1C001R0010000081665Chr3592599056042M58S=15548135996221455GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTHHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:42MC:Z:81M19SAS:i:42XS:i:29SA:Z:Chr2,207068753,+,41S59M,0,0;Yt:Z:UMNEXT_SAMV350210693L1C001R001000008162113Chr2207068753041H59MChr31554813590CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:59MC:Z:81M19HAS:i:59XS:i:59SA:Z:Chr3,59259905,+,42M58S,60,0;Yt:Z:UM        V350210693L1C001R00100000816129Chr31554813596081M19S=59259905-96221455AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGACCF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:81MC:Z:42M58SAS:i:81XS:i:50SA:Z:Chr1,67516115,+,77S23M,0,0;Yt:Z:UMNEXT_SAMV350210693L1C001R001000008162177Chr167516115077H23MChr3592599050GAAATTACTCCAAAAAAGAGGACG#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:23MC:Z:42M58HAS:i:23XS:i:23SA:Z:Chr3,155481359,+,81M19S,60,0;Yt:Z:UM        60      0
V350210693L1C001R00100000816    !       0       Chr3    155481359       +       +       MU      V350210693L1C001R0010000081665Chr3592599056042M58S=15548135996221455GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTHHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:42MC:Z:81M19SAS:i:42XS:i:29SA:Z:Chr2,207068753,+,41S59M,0,0;Yt:Z:MUNEXT_SAMV350210693L1C001R001000008162113Chr2207068753041H59MChr31554813590CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:59MC:Z:81M19HAS:i:59XS:i:59SA:Z:Chr3,59259905,+,42M58S,60,0;Yt:Z:MU        V350210693L1C001R00100000816129Chr31554813596081M19S=59259905-96221455AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGACCF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:81MC:Z:42M58SAS:i:81XS:i:50SA:Z:Chr1,67516115,+,77S23M,0,0;Yt:Z:MUNEXT_SAMV350210693L1C001R001000008162177Chr167516115077H23MChr3592599050GAAATTACTCCAAAAAAGAGGACG#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:23MC:Z:42M58HAS:i:23XS:i:23SA:Z:Chr3,155481359,+,81M19S,60,0;Yt:Z:MU        0       60
V350210693L1C001R00100000903    Chr3    27240151        !       0       +       +       UM      V350210693L1C001R0010000090365Chr3272401515056M44S=25648197-1591955CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATATDCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?DNM:i:0MD:Z:56MC:Z:100MAS:i:56XS:i:41SA:Z:Chr1,21357395,-,45M55S,0,0;Yt:Z:UMNEXT_SAMV350210693L1C001R001000009032129Chr121357395045M55HChr3256481970ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACAD?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??ANM:i:0MD:Z:45MC:Z:100MAS:i:45XS:i:45SA:Z:Chr3,27240151,+,56M44S,50,0;Yt:Z:UM    V350210693L1C001R00100000903129Chr32564819751100M=272401511591955CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAAED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E)NM:i:0MD:Z:100MC:Z:56M44SAS:i:100XS:i:80Yt:Z:UM        50      0
V350210693L1C001R00100000903    !       0       Chr3    25648197        -       +       MU      V350210693L1C001R0010000090365Chr3272401515056M44S=25648197-1591955CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATATDCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?DNM:i:0MD:Z:56MC:Z:100MAS:i:56XS:i:41SA:Z:Chr1,21357395,-,45M55S,0,0;Yt:Z:MUNEXT_SAMV350210693L1C001R001000009032129Chr121357395045M55HChr3256481970ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACAD?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??ANM:i:0MD:Z:45MC:Z:100MAS:i:45XS:i:45SA:Z:Chr3,27240151,+,56M44S,50,0;Yt:Z:MU    V350210693L1C001R00100000903129Chr32564819751100M=272401511591955CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAAED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E)NM:i:0MD:Z:100MC:Z:56M44SAS:i:100XS:i:80Yt:Z:MU        0       51

And using 5unique, I got:

V350210693L1C001R00100000816    Chr3    59259905        Chr3    155481359       +       +       uu      V350210693L1C001R0010000081665Chr3592599056042M58S=15548135996221455GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTHHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:42MC:Z:81M19SAS:i:42XS:i:29SA:Z:Chr2,207068753,+,41S59M,0,0;Yt:Z:uuNEXT_SAMV350210693L1C001R001000008162113Chr2207068753041H59MChr31554813590CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:59MC:Z:81M19HAS:i:59XS:i:59SA:Z:Chr3,59259905,+,42M58S,60,0;Yt:Z:uu        V350210693L1C001R00100000816129Chr31554813596081M19S=59259905-96221455AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGACCF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:81MC:Z:42M58SAS:i:81XS:i:50SA:Z:Chr1,67516115,+,77S23M,0,0;Yt:Z:uuNEXT_SAMV350210693L1C001R001000008162177Chr167516115077H23MChr3592599050GAAATTACTCCAAAAAAGAGGACG#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:23MC:Z:42M58HAS:i:23XS:i:23SA:Z:Chr3,155481359,+,81M19S,60,0;Yt:Z:uu        60      60
V350210693L1C001R00100000903    Chr3    25648197        Chr3    27240151        +       +       Uu      V350210693L1C001R00100000903129Chr32564819751100M=272401511591955CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAAED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E)NM:i:0MD:Z:100MC:Z:56M44SAS:i:100XS:i:80Yt:Z:Uu        V350210693L1C001R0010000090365Chr3272401515056M44S=25648197-1591955CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATATDCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?DNM:i:0MD:Z:56MC:Z:100MAS:i:56XS:i:41SA:Z:Chr1,21357395,-,45M55S,0,0;Yt:Z:UuNEXT_SAMV350210693L1C001R001000009032129Chr121357395045M55HChr3256481970ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACAD?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??ANM:i:0MD:Z:45MC:Z:100MAS:i:45XS:i:45SA:Z:Chr3,27240151,+,56M44S,50,0;Yt:Z:Uu    51      50

I use version 1.0.3. I expected all should contain all the pairs generated by 5unique. May be I missed anything? Do you have any suggestions for this complicated data set?