Closed gdolsten closed 1 year ago
I think it should work! You just have to try... Probably using integer UMI IDs would be faster than strings though.
Great, thanks. Do you also know what the upper/lower case here means?
My pairtools QC returns these different types of read pairs. Do you know what the upper/lower case from pairtools stats means?
Index(['total', 'total_unmapped', 'total_single_sided_mapped', 'total_mapped', 'total_dups', 'total_nodups', 'cis', 'trans', 'pair_types/MM', 'pair_types/Mm', 'pair_types/NM', 'pair_types/NN', 'pair_types/Nm', 'pair_types/Nn', 'pair_types/mM', 'pair_types/mm', 'pair_types/nM', 'pair_types/nN', 'pair_types/nm', 'pair_types/nn', 'pair_types/Mu', 'pair_types/MU', 'pair_types/MR', 'pair_types/NR', 'pair_types/NU', 'pair_types/nU', 'pair_types/mU', 'pair_types/Nu', 'pair_types/mu', 'pair_types/nu', 'pair_types/UU', 'pair_types/DD', 'pair_types/UR', 'pair_types/uu', 'pair_types/RU', 'pair_types/Uu', 'pair_types/uU',]
Hi, just wanted to ping this!
I think lower case types come from rescued walks, is that correct @agalitsyna ?
I thought that "R" stood for rescued.
Lower letters mean the chimeric reads, where the read continuity on the genome has been disrupted. Lower letter types will be reported for regimes "5any", "5unique", "3any", and "3unique" (but not "mask" or "all"; it also applies for the pairtools >=1.0.2, for much earlier versions, there was no "all" regime). In these regimes, U alignment can be reported as chimeric. It does not make much difference to treat lower-letter pair types differently in the multiqc report.
I see, so if we get a complex "walk" then with my current 5unique walks policy we will report all of the consecutive ligation junctions. Since each of these are UNIQUE they will get a U, but it will be lower case since it is from a walk.
The tutorial code has the following pairtools select line:
pairtools select --nproc-in {threads} --nproc-out {threads} -o {output} '(pair_type=="UU") or (pair_type=="UR") or (pair_type=="RU")' {input}
Do you recommend modifying this to include uu, uR, Ru, Ur, etc?
@agalitsyna do you have any thoughts on the above ?
I'm not entirely sure what will be the best case for you. The decision shall be based on (1) how large is the pairs file, (2) what are the advantages of using 5unique mode (maybe you can switch to some other more popular option of pairtools, if there are no other advantages; all
is currently the best-recommended option as it recovers the most Hi-C/uC contacts), and (3) how much more in the future you anticipate using this particular mode on large files.
A. If the file is small, I would modify the pairtools select command to include all lowercase combinations (I would run awk '{print $8}' | uniq
first to get an idea of what combinations are relevant).
I like this solution as it will record your action in the .pairs header, and you can run this command in your pipelines again.
B. If your pipeline is more complicated and you anticipate running the statistics with different tools, I recommend re-generating the file with forced upper-case for the 8th column with the pair types. Then you can use this file with any tools working on .pairs file. The example command: grep -v '^#' | awk '{ print $1, $2, $3, $4, $5, $6, $7, toupper($8) }'
(modify to have the number of columns that you have in your .pairs file; also the header shall be included independently at the beginning of the file).
_I don't like this solution as it forces you to do non-standard manipulation on the file that will make your pipeline more dirty. Also, this manipulation won't appear in the header of the .pairs file as part of the history tracking that we implemented in pairtools._
C. If you think you have an important use case of pairtools and your pipeline is losing performance due to lower-case encoding of pair types, we can think of implementing an option to switch this lower-case mode off. I personally like this idea the most, as it makes pairtools better suited for manipulating multiple data formats. However, it also increases the number of potentially unnecessary parameters and generally increases the support management in the future.
@gdolsten Can you let me know your thinking and how much you will benefit from any of these solutions?
Hi, I hope this helps and your issue got resolved! Feel free to re-open if you have suggestions or related issues.
I'm not entirely sure what will be the best case for you. The decision shall be based on (1) how large is the pairs file, (2) what are the advantages of using 5unique mode (maybe you can switch to some other more popular option of pairtools, if there are no other advantages;
all
is currently the best-recommended option as it recovers the most Hi-C/uC contacts), and (3) how much more in the future you anticipate using this particular mode on large files.A. If the file is small, I would modify the pairtools select command to include all lowercase combinations (I would run
awk '{print $8}' | uniq
first to get an idea of what combinations are relevant). I like this solution as it will record your action in the .pairs header, and you can run this command in your pipelines again.B. If your pipeline is more complicated and you anticipate running the statistics with different tools, I recommend re-generating the file with forced upper-case for the 8th column with the pair types. Then you can use this file with any tools working on .pairs file. The example command:
grep -v '^#' | awk '{ print $1, $2, $3, $4, $5, $6, $7, toupper($8) }'
(modify to have the number of columns that you have in your .pairs file; also the header shall be included independently at the beginning of the file). _I don't like this solution as it forces you to do non-standard manipulation on the file that will make your pipeline more dirty. Also, this manipulation won't appear in the header of the .pairs file as part of the history tracking that we implemented in pairtools._C. If you think you have an important use case of pairtools and your pipeline is losing performance due to lower-case encoding of pair types, we can think of implementing an option to switch this lower-case mode off. I personally like this idea the most, as it makes pairtools better suited for manipulating multiple data formats. However, it also increases the number of potentially unnecessary parameters and generally increases the support management in the future.
@gdolsten Can you let me know your thinking and how much you will benefit from any of these solutions?
Hi @agalisyna,
I found using 5unique
actually generates more valid pairs than all
in my data, likely due to short fragments during the library construction and both reads of a pair contain ligation junctions.
Here is an example of two reads:
350210693L1C001R00100000816 65 Chr3 59259905 60 42M58S = 155481359 96221455 GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATT HHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBA NM:i:0 MD:Z:42 MC:Z:81M19S AS:i:42 XS:i:29 SA:Z:Chr2,207068753,+,41S59M,0,0;
V350210693L1C001R00100000816 2113 Chr2 207068753 0 41H59M Chr3 155481359 0 CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATT CHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBA NM:i:0 MD:Z:59 MC:Z:81M19H AS:i:59 XS:i:59 SA:Z:Chr3,59259905,+,42M58S,60,0;
V350210693L1C001R00100000816 129 Chr3 155481359 60 81M19S = 59259905 -96221455 AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGAC CF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCC NM:i:0 MD:Z:81 MC:Z:42M58S AS:i:81 XS:i:50 SA:Z:Chr1,67516115,+,77S23M,0,0;
V350210693L1C001R00100000816 2177 Chr1 67516115 0 77H23M Chr3 59259905 0 GAAATTACTCCAAAAAAGAGGAC G#E"@@*E?3FAE#.B.C=)FCC NM:i:0 MD:Z:23 MC:Z:42M58H AS:i:23 XS:i:23 SA:Z:Chr3,155481359,+,81M19S,60,0;
V350210693L1C001R00100000903 65 Chr3 27240151 50 56M44S = 25648197 -1591955 CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATAT DCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?D NM:i:0 MD:Z:56 MC:Z:100M AS:i:56 XS:i:41 SA:Z:Chr1,21357395,-,45M55S,0,0;
V350210693L1C001R00100000903 2129 Chr1 21357395 0 45M55H Chr3 25648197 0 ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACA D?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??A NM:i:0 MD:Z:45 MC:Z:100M AS:i:45 XS:i:45 SA:Z:Chr3,27240151,+,56M44S,50,0;
V350210693L1C001R00100000903 129 Chr3 25648197 51 100M = 27240151 1591955 CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAA ED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E) NM:i:0 MD:Z:100 MC:Z:56M44S AS:i:100 XS:i:80
Using all
, I got:
V350210693L1C001R00100000816 Chr3 59259905 ! 0 + + UM V350210693L1C001R0010000081665Chr3592599056042M58S=15548135996221455GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTHHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:42MC:Z:81M19SAS:i:42XS:i:29SA:Z:Chr2,207068753,+,41S59M,0,0;Yt:Z:UMNEXT_SAMV350210693L1C001R001000008162113Chr2207068753041H59MChr31554813590CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:59MC:Z:81M19HAS:i:59XS:i:59SA:Z:Chr3,59259905,+,42M58S,60,0;Yt:Z:UM V350210693L1C001R00100000816129Chr31554813596081M19S=59259905-96221455AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGACCF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:81MC:Z:42M58SAS:i:81XS:i:50SA:Z:Chr1,67516115,+,77S23M,0,0;Yt:Z:UMNEXT_SAMV350210693L1C001R001000008162177Chr167516115077H23MChr3592599050GAAATTACTCCAAAAAAGAGGACG#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:23MC:Z:42M58HAS:i:23XS:i:23SA:Z:Chr3,155481359,+,81M19S,60,0;Yt:Z:UM 60 0
V350210693L1C001R00100000816 ! 0 Chr3 155481359 + + MU V350210693L1C001R0010000081665Chr3592599056042M58S=15548135996221455GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTHHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:42MC:Z:81M19SAS:i:42XS:i:29SA:Z:Chr2,207068753,+,41S59M,0,0;Yt:Z:MUNEXT_SAMV350210693L1C001R001000008162113Chr2207068753041H59MChr31554813590CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:59MC:Z:81M19HAS:i:59XS:i:59SA:Z:Chr3,59259905,+,42M58S,60,0;Yt:Z:MU V350210693L1C001R00100000816129Chr31554813596081M19S=59259905-96221455AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGACCF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:81MC:Z:42M58SAS:i:81XS:i:50SA:Z:Chr1,67516115,+,77S23M,0,0;Yt:Z:MUNEXT_SAMV350210693L1C001R001000008162177Chr167516115077H23MChr3592599050GAAATTACTCCAAAAAAGAGGACG#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:23MC:Z:42M58HAS:i:23XS:i:23SA:Z:Chr3,155481359,+,81M19S,60,0;Yt:Z:MU 0 60
V350210693L1C001R00100000903 Chr3 27240151 ! 0 + + UM V350210693L1C001R0010000090365Chr3272401515056M44S=25648197-1591955CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATATDCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?DNM:i:0MD:Z:56MC:Z:100MAS:i:56XS:i:41SA:Z:Chr1,21357395,-,45M55S,0,0;Yt:Z:UMNEXT_SAMV350210693L1C001R001000009032129Chr121357395045M55HChr3256481970ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACAD?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??ANM:i:0MD:Z:45MC:Z:100MAS:i:45XS:i:45SA:Z:Chr3,27240151,+,56M44S,50,0;Yt:Z:UM V350210693L1C001R00100000903129Chr32564819751100M=272401511591955CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAAED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E)NM:i:0MD:Z:100MC:Z:56M44SAS:i:100XS:i:80Yt:Z:UM 50 0
V350210693L1C001R00100000903 ! 0 Chr3 25648197 - + MU V350210693L1C001R0010000090365Chr3272401515056M44S=25648197-1591955CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATATDCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?DNM:i:0MD:Z:56MC:Z:100MAS:i:56XS:i:41SA:Z:Chr1,21357395,-,45M55S,0,0;Yt:Z:MUNEXT_SAMV350210693L1C001R001000009032129Chr121357395045M55HChr3256481970ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACAD?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??ANM:i:0MD:Z:45MC:Z:100MAS:i:45XS:i:45SA:Z:Chr3,27240151,+,56M44S,50,0;Yt:Z:MU V350210693L1C001R00100000903129Chr32564819751100M=272401511591955CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAAED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E)NM:i:0MD:Z:100MC:Z:56M44SAS:i:100XS:i:80Yt:Z:MU 0 51
And using 5unique
, I got:
V350210693L1C001R00100000816 Chr3 59259905 Chr3 155481359 + + uu V350210693L1C001R0010000081665Chr3592599056042M58S=15548135996221455GCACACAGTCATTCCCGTCCCACCTTCGAAATCGCCGTCCGCGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTHHEFFFEGCFECCFFCHBFGHDFHADHGADECFF;GHBHHGCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:42MC:Z:81M19SAS:i:42XS:i:29SA:Z:Chr2,207068753,+,41S59M,0,0;Yt:Z:uuNEXT_SAMV350210693L1C001R001000008162113Chr2207068753041H59MChr31554813590CGCTTCGGCCGAGTGCCGAAGGCTGCTGTCATCAGCTGGTCCTCTTTTTTGGAGTAATTCHD@CHB=FHD-H:GFF9EAGHBAGF=EAFBACECADGF?:FC@C@ACBCG=AGAEEBANM:i:0MD:Z:59MC:Z:81M19HAS:i:59XS:i:59SA:Z:Chr3,59259905,+,42M58S,60,0;Yt:Z:uu V350210693L1C001R00100000816129Chr31554813596081M19S=59259905-96221455AGAGCAGGGATGGCAGAGGAGCAGTCCCAAAACCGGTTGAACCGGTTTCCGGACCGGTTGAACCGATTTCCAGGGCTGAAATTACTCCAAAAAAGAGGACCF9HGBFHA;>G@>/G#EF;GF;H=GBCC9==GDEF3?ED/EHH7<B>FD?A'F:<E,@G'$@EG8>.A4E)<CH58G#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:81MC:Z:42M58SAS:i:81XS:i:50SA:Z:Chr1,67516115,+,77S23M,0,0;Yt:Z:uuNEXT_SAMV350210693L1C001R001000008162177Chr167516115077H23MChr3592599050GAAATTACTCCAAAAAAGAGGACG#E"@@*E?3FAE#.B.C=)FCCNM:i:0MD:Z:23MC:Z:42M58HAS:i:23XS:i:23SA:Z:Chr3,155481359,+,81M19S,60,0;Yt:Z:uu 60 60
V350210693L1C001R00100000903 Chr3 25648197 Chr3 27240151 + + Uu V350210693L1C001R00100000903129Chr32564819751100M=272401511591955CATATAATGCTTCAAACGGTGACATCTTGATACTGGCCTAGTAGATGTTGTTATACGAGAACTCTGCATATGGCAAACTGGACTCCCAATTTCTGTCAAAED?D?D5<GEBAHEE1FHF=C2DC@>DBG6?AHCGHEHC@FC699?GD5F<A>C=6G/HB@<CACH:A?C9G==>B8GAE>B<?HEGE)C37?CHC7+E)NM:i:0MD:Z:100MC:Z:56M44SAS:i:100XS:i:80Yt:Z:Uu V350210693L1C001R0010000090365Chr3272401515056M44S=25648197-1591955CTTTCTTATTACTAAGGTTTGGTATTTACTTACACTTAATGACTGCTAATAAAATTGTGTGTTGACGTTTGACAGAAATTGGGAGTCCAGTTTGCCATATDCB;G9@BABDHAECBGDBB?E7=C<BDGB?8GBGBD=9@G;FC=H?DCCED9DBA??HCF?C<ECG<77HAGFHCDDB=A;1-HAGG1?8A1GGDAD?DNM:i:0MD:Z:56MC:Z:100MAS:i:56XS:i:41SA:Z:Chr1,21357395,-,45M55S,0,0;Yt:Z:UuNEXT_SAMV350210693L1C001R001000009032129Chr121357395045M55HChr3256481970ATATGGCAAACTGGACTCCCAATTTCTGTCAAACGTCAACACACAD?DADGG1A8?1GGAH-1;A=BDDCHFGAH77<GCE<C?FCH??ANM:i:0MD:Z:45MC:Z:100MAS:i:45XS:i:45SA:Z:Chr3,27240151,+,56M44S,50,0;Yt:Z:Uu 51 50
I use version 1.0.3. I expected all
should contain all the pairs generated by 5unique
.
May be I missed anything? Do you have any suggestions for this complicated data set?
--extra-col-pair
Extra columns that also must match for two pairs to be marked as duplicates. Can be either provided as 0-based column indices or as column names (requires the “#columns” header field). The option can be provided multiple times if multiple column pairs must match. Example: –extra-col-pair “phase1” “phase2”. [output format option]
If I have added UMIs as a column, can I pass this into the extra-col-pair? I.e., pairtools dedup ... --extra-col-pair 8 ...